The widespread prevalence of COVID-19 has urged countries to eradicate the virus by vaccinating their citizens. However, such a large-scale vaccination plan has encountered hesitancy and opposition from many people. The fear, doubts, and suspicions around the vaccine have appeared not only among the general citizens but also among the medical workers such as doctors and nurses (Browne et al. 2021). For instance, a survey shared by the Centers for Disease Control and Prevention found that 63% of surveyed health care workers reported they would not be willing to accept a COVID-19 vaccination (Gorfinkel and Perlow 2022). Given this, we aim to answer the research question: ‘Why are people unwilling to take or even against COVID-19 vaccines?’ In doing so, we hope to find some variables that can be useful in improving vaccination rates.
The Oxford English dictionary defines an anti-vaxxer as “a person who is opposed to vaccination”(Stevenson 2010). Some researchers have also characterized vaccination hesitancy as a more polite usage of the term anti-vaccination (Razai et al. 2021). However, some researchers argue that anti-vaxxers and those with vaccine hesitancy should be considered as two different groups - the former being those who promote reductionist, broad-brush statements and ignore science, while the latter expecting a scientific response (Berman 2020; Gorfinkel and Perlow 2022). In the current research, we agree with the first argument that treats anti-vaccination and vaccine hesitancy as the same concept. Therefore, we will use these two terms interchangeably in this project.
Our current research provides an answer to the main research question by exploring two sub-questions, using two types of data (i.e. Twitter data and survey data).
In Study 1, with Twitter data from 01 October 2020 to 31 March 2021, we will explore what topics are associated with the anti-vaccination sentiment using an unsupervised topic model. The Latent Dirichlet Allocation (LDA) method will be used to construct the topic model. This analysis allows a glimpse of the characteristics of anti-vaccination sentiment - namely, where the negative sentiments about COVID-19 vaccination arise. With the insights we gather from Study 1, we further investigate the variables that may contribute to people’s hesitation toward vaccination through survey data collected from the EU-27. The supervised machine learning methods, decision tree and random forest, will be employed.
As this survey data was collected between 15 February 2021 and 30 March 2021, we will be able to combine the insights with the ones from the Twitter data to obtain a more continuous inference. Additionally, this survey data provide responses to more specific questions regarding one’s attitude towards COVID-19 vaccines and more detailed personal characteristics of the respondents (e.g., psychological profile, socioeconomic and demographic information), which are not available in the Twitter data. This richness of the data serves us to draw a better classification model to discover what makes one hesitant to vaccinate.
Conclusively, we aim to explore two different types of ‘big’ data - Twitter data and (big) survey data - throughout this research. As we learned about the strong- and weak points of the two from the literature review (Assignment 1), we analyze the two types of data with the main research question in mind. We hope our research can shed a light on the current issue of vaccine hesitancy and help us understand the reasons behind people’s reluctance or opposition to the COVID-19 vaccines.
1 Table of Contents
2 Study 1: NLP on Twitter Data
The data sets come from Muric, Wu, and Ferrara (2021) who have been collecting the anti-vaccination Twitter data since the beginning of the COVID-19 outbreak. Two types of data have been collected and made publicly available: streaming and account tweets. The former contains tweets collected based on keywords related to anti-vaccination, and the latter targets anti-vaxxer accounts and gives all the posts of those targeted accounts. Between the two types of data sets, we work with the streaming data from October 2020 to March 2021. Although the account data may give us more insights into the general characteristics of anti-vaccination sentiment, the size of the account data is too large to handle for the computational power we own. Moreover, as it includes all the tweets certain accounts posted, there is a lot of noise in the data which can lead to undesirable results for the research question. Hence, the streaming data is preferred as it involves smaller data sets and allows us to focus on the research question of the characteristics of anti-vaccination sentiment.
In the following analysis, we use the streaming data of October 2020 to March 2021 which are the first 6 month period of collected data and where the COVID vaccine started being tested and prepared for distribution (Research 2022). We selected these data sets assuming that the anti-vaccination sentiment was more strongly expressed on social media at the beginning of the distribution than later.
Due to the large data size, we will first focus on the tweets of October 2020 and November 2020. Then we will delve further into the rest of the data sets. Note here that, despite a large number of observation, we do not use a sampling method since each observation only contains maximum 280 characters and we have not encountered any problems running the following codes with the amount of data we have.
2.1 Setup
2.1.1 Packages
library(glue)
library(dplyr)
library(tidyverse) #need 'stringr' package which is included in 'tidyverse'; general utility & workflow functions
library(tidytext) # tidy implimentation of NLP methods
library(ggplot2)
library(gridExtra) # for ggplot arrangement
library(tibble)
library(broom)
#Tokenization
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
#Lemmatization
library(textstem)
#Natural Language Processing
library(tm) # general text mining functions, making document term matrices
library(seededlda)
library(DT)
library(rtweet) # To get emojis dataset
2.1.2 Loading Data
First we will download all the data and divide them into different data frames per month. Following the privacy policies, only Tweet IDs have been shared in the Git repository of Muric, Wu, and Ferrara (2021). Therefore, we used Hydrator to hydrate the obtained Tweet IDs and created cvs files.
setwd("~/serene2/cabd-group31")
data_all <- lapply(Sys.glob(paths = "202*-*-merged.csv"), read.csv)
oct <- as.data.frame(data_all[1])
nov <- as.data.frame(data_all[2])
dec <- as.data.frame(data_all[3])
jan <- as.data.frame(data_all[4])
feb <- as.data.frame(data_all[5])
mar <- as.data.frame(data_all[6])
data <- rbind(oct, nov) # Only use oct&nov due to big size for the first part
The data contains 35 variables e.g., coordinates, hashtags, urls, id, etc. As we only need the contents that people posted, we take only the Tweets from the data.
tweets <- data[,'text']
2.2 LDA on Oct & Nov Data Sets
Reference: some codes from “Lecture 5 - Exercises and Solutions” are adopted in the following parts.
In this section, we perform an unsupervised topic modeling on the October and November data sets with LDA method. Firstly, we preprocess and normalize the texts such that the textual data can be easily comprehended by the LDA algorithm. Then, in the second part, we will explore the frequently appeared terms which enlighten the possible topics emerging from the data. Lastly, we present the LDA results together with its visualization.
2.2.1 Preprocessing
2.2.1.1 Data Cleaning
Reference: https://stringr.tidyverse.org/articles/regular-expressions.html
Tweets contain contractions, abbreviations, emojis, non-alphanumeric characters, inter alia, that disturb the topic model to yield interpretable results. To focus on what really contributes to forming topics, we disregard the following types of texts:
ID-Tags and Hashtags
URL
Emojis and Punctuations
Non-alphabetic letters (e.g., Han, Hangul, Hiragana, Katakana, Arabic, Devanagari, etc.)
Numbers
HTML codes and other trailing white space
The following function will remove the majority of the words and characters that are not important for the analysis. However, given that many languages use Latin alphabets and we do not limit the scope to a specific location, non-English terms may appear predominant in the following topic analysis. For those, we will have to run the codes and remove the frequently appearing non-English words manually.
# Function to remove unnecessary words/letters
clean <- function (contents){
# Unifying the curly/straight quotes into straight ones
contents <- gsub("[“”]", "\"", gsub("[‘’]", "'", contents))
contents <- iconv(contents, "latin1", "ASCII", sub="") # no non-latin alphabets
contents %>%
tolower() %>% # to lower case
str_remove_all("(^|\\s)([@#][\\w_-]+)") %>% # remove id-tags & hashtags
str_remove_all("(https?://\\S+)") %>% # remove the links
str_replace_all("'s", " ") %>% # Removing 's e.g., today's
str_replace_all("[:punct:]", " ") %>% # remove punctuations
str_replace_all("~", " ") %>% # [:punct:] doesn't remove tilde
str_replace_all("[:digit:]", " ") %>%
str_remove_all("[:emoji:]") %>% # No more emojis! (Just in case!)
str_replace_all("[^A-Za-z0-9]", " ") %>% # Non alphanumeric letters
str_replace_all("\\s+", " ") %>% # replace HTML tag \n for new lines
trimws() # trim the leading and/or trailling whitespace
}
# Apply the function to the data
cleaned_tweets <- clean(tweets)
# Check whether the function does a good job
# sum(str_detect(cleaned_tweets, "निर्मल") == TRUE)
head(cleaned_tweets, 10)
## [1] "who has been always helping in spreading infectious diseases in order to do vaccination business and always exposed just like swine flu this will also get exposed agenda"
## [2] "the adjustment bureau will love to conquer all hate movieclips facing the powerful agents of fate david norris must either accept or defy fate to be with his lover elise sellas"
## [3] "vaccines"
## [4] "sane good police in port coquitlam amp bc i need your help to get out of poisoned drug contaminated homes real trudeau amp qais need your help it about illuminati depopulation amp cov insanity"
## [5] "put today times better then anyone else i can think of"
## [6] "who has been always helping in spreading infectious diseases in order to do vaccination business and always exposed just like swine flu this will also get exposed agenda"
## [7] "you are so full of shit those vaccines are going to kill and you know it all part of your depopulation plan who the hell made you god"
## [8] "hero doctors speaking out against the lies and happening all around the world wake up people fight back against the lies amp restrictions"
## [9] "everyone that believes in eugenics and depopulation should full send and kill themselves for their cause talk the talk but won t walk the walk"
## [10] "who has been always helping in spreading infectious diseases in order to do vaccination business and always exposed just like swine flu this will also get exposed agenda"
# Adding the cleaned text into the dataframe
data$cleaned_tweets <- cleaned_tweets
2.2.1.2 Creating Corpus
Once we have a cleaned data set, we create a corpus of the textual data to which we will employ a language model later.
corp_tweets <- corpus(data, text_field = "cleaned_tweets")
2.2.1.3 Lemmatization
With the corpus, we lemmatize the words. Thus, e.g., better becomes good, has becomes have, and so on.
lem_tweets <- lemmatize_strings(corp_tweets)
2.2.1.4 Tokenization
Once the lemmatized, we tokenize the corpus of the text data into words. Here, we remove all the stop words. Often, stopwords("english") is used but we opted for SMART as it contains more stop words than the former. As mentioned in Section 2.2.1.1, we also added other words that we saw appearing even after removing the stop words such as bc (the abbreviation of the term, because), don (a part of don’t), and French articles, de and la.
Note that the stop words include negations - e.g., don’t, not, - which might be influential for sentiment analysis. But for the topic modeling, it is preferred to remove the negations.
# Tokenization
toks_tweets <- quanteda::tokens(lem_tweets, what="word", remove_numbers = T, remove_symbols = T)
# stopwords("english")
# stopwords("SMART") # It has more words than "english"
toks_tweets_nostop <- tokens_remove(toks_tweets, c(stopwords("SMART"), "bc", "don", "isn", "aren", "de", "la"))
# head(toks_tweets_nostop)
2.2.2 Exploratory Analysis
2.2.2.1 Frequent Words
With the preprocessed data, we can already look at the frequent words. This allows to already imagine the topics that emerge from the tweets we gathered.
# Function for word frequency
word_fq <- function(toks){
fq <- as_tibble(unlist(toks))
fq %>%
group_by(value) %>%
summarise(n = n()) %>%
arrange(desc(n))}
DT::datatable(word_fq(toks_tweets_nostop), class = "display", rownames = FALSE, options = list(pageLength = 5, autoWidth = FALSE))
The top 3 occurring words are depopulation, vaccine, and amp (which may stand for rich, money, power as suggested in Urban Dictionary (Sluggie420 2016)). Linking with words like gate which probably comes from Bill Gates and agendum and plan, we can already guess that many tweets may concern the manipulation of the population by the elites in order to depopulate the world.
2.2.2.2 Unigrams + WordCloud
The frequent words can be visualized as a word cloud.
# Unigrams
unigram_toks <- tokens_ngrams(toks_tweets_nostop, n=1)
unigram_dfm <- dfm(unigram_toks)
unigram_freq <-textstat_frequency(unigram_dfm)
# Plot wordcloud to show most frequent words
textplot_wordcloud(unigram_dfm, max_words = 200, ordered_color = TRUE)
- The most frequent word is
depopulationand thenvaccineandampfollow the next as we saw in the table above.
2.2.2.3 Bigrams + WordCloud
Frequent words or unigrams normally consider each word individually. However, sometimes it is more meaningful to look at a set of words together. For instance, when Bill appeared in the unigram word cloud, one may think that there were tweets talking about a new law or amount of money one owes to someone, while seeing Bill together with Gates gives a different interpretation of the tweets. For this reason, we look at the Bigrams and the word cloud they create.
# Bigrams
bigram_toks <- tokens_ngrams(toks_tweets_nostop, n=2)
bigram_dfm <- dfm(bigram_toks)
bigram_freq <-textstat_frequency(bigram_dfm)
# bigram_freq
# Plot wordcloud to show most frequent words
textplot_wordcloud(bigram_dfm, max_words = 100,
min_size = 0.7, max_size = 3,
ordered_color = TRUE)
Bill_Gate,Depopulation_Agendum(i.e., lemma of Agenda), andGreat_Resetare top 3 of the most frequent words. Again, we can interpret that there is a sentiment that ‘elite such as Bill Gates trying to de-populate the world.’
2.2.2.4 Trigrams
Trigrams are also possible for the same reason as for bigrams. Here, we did not ask for word cloud since it took long to load it and the result did not seem as much insightful as with the bigrams. Instead, we look at the frequency table.
# Trigrams
trigram_toks <- tokens_ngrams(toks_tweets_nostop, n=3)
trigram_dfm <- dfm(trigram_toks)
trigram_freq <-textstat_frequency(trigram_dfm)
DT::datatable(trigram_freq, class = "display", rownames = F, options = list(pageLength = 5, autoWidth = FALSE))
- While a similar conclusion is possible, we also see terms like
slave_british_colonyanddepopulation_hap_obigbo. Although it is difficult to grasp the full picture with only key words, it can be understood that some people linked the depopulation conspiracy of COVID-19 to the arrest of Obigbo residents by the soldiers in Nigeria (see David 2020).
2.2.2.5 Document-feature Matrix Plot
Document-feature matrix is another way to explore the frequent words. As we did it for word clouds, we can also create these plots for uni- and bigrams.
# Plot it
unigram_dfm %>% # Unigram dfm created before
textstat_frequency(n = 30) %>%
ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
geom_point() +
coord_flip() +
labs(x = NULL, y = "Frequency") +
theme_minimal()
# With Bigram (More informative)
bigram_dfm %>%
textstat_frequency(n = 30) %>%
ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
geom_point() +
coord_flip() +
labs(x = NULL, y = "Frequency") +
theme_minimal()
2.2.3 LDA (Unsupervised Text Modelling)
Now we apply the LDA algorithm to the text data we preprocessed. For this, we use a document-feature matrix which we already created with Section 2.2.2.2. Moreover, we trim the document feature matrix such that the words that mentioned too many times or too fewer times will be neglected in the analysis. Extremely high or low frequency of words implies their low importance in the content. Given the word limits on Tweets and the large number of the ‘documents’ (i.e., Tweets), we only apply the limits based on the proportion of each term across documents using the arguments of min_docfreq and max_docfreq, instead of limiting an exact minimum/maximum number of words occurring by min_termfreq / max_termfreq. Accordingly, we keep the terms that occur at least in 1 of 100 documents and at most, in 4 of 5 documents.
dfmtrim <- unigram_dfm %>%
dfm_trim(min_docfreq = 0.01, max_docfreq = 0.8, docfreq_type = "prop")
set.seed(0)
tweets_lda <- textmodel_lda(dfmtrim, k = 10) # 10 topics
DT::datatable(terms(tweets_lda), class = "display", rownames=FALSE, options = list(pageLength = 10, autoWidth = FALSE))
The LDA yields phi and theta where the former gives the distribution of words for topic k, and the latter produces the distribution of topics for document i. We will utilize phi values to visualize what terms are highly engaged with each topic.
# tweets_lda$phi # between topic and words
# tweets_lda$theta # between each text and topic
# class(tweets_lda$phi)
lda_results <- as.data.frame(t(tweets_lda$phi))
DT::datatable(lda_results, class = "display", rownames = TRUE, options = list(pageLength = 5, autoWidth = FALSE))
2.2.3.1 Plotting
Reference: Tatman, R (https://www.kaggle.com/code/rtatman/nlp-in-r-topic-modelling)
Based on the phi values from the LDA, we plot the values with the high weight for each topic.
results2 <- as_tibble(lda_results, rownames = NA) %>%
rownames_to_column(var = 'words')
top_results <- function(results2){
for (topic in colnames(results2[2:11])) {
top <- results2[, c('words', topic)] %>%
slice_max(results2[, topic], n=10)
# top <- results2[, c('words', topic)] %>%
# top_n(10, results2[, topic])
assign(paste0('top', topic), top, envir = parent.frame())}
}
top_results(results2)
# Plotting per topic
# Somehow errors occurred constantly when using for-loop with ggplot. Hence, we wrote a less elegant function with all the elements in it.
plt_fn <- function(toptopic1, toptopic2, toptopic3, toptopic4, toptopic5, toptopic6, toptopic7, toptopic8, toptopic9, toptopic10){
p1 <- ggplot(data=toptopic1, aes(words, topic1, fill = "orange")) +
geom_col(show.legend = FALSE) + # as a bar plot
labs(x = NULL, y = NULL, title="Topic 1") +
coord_flip()
p2 <- ggplot(data=toptopic2, aes(words, topic2, fill = "orange")) +
geom_col(show.legend = FALSE) + # as a bar plot
labs(x = NULL, y = NULL, title="Topic 2") +
coord_flip()
p3 <- ggplot(data=toptopic3, aes(words, topic3, fill = "orange")) +
geom_col(show.legend = FALSE) + # as a bar plot
labs(x = NULL, y = NULL, title="Topic 3") +
coord_flip()
p4 <- ggplot(data=toptopic4, aes(words, topic4, fill = "orange")) +
geom_col(show.legend = FALSE) + # as a bar plot
labs(x = NULL, y = NULL, title="Topic 4") +
coord_flip()
p5 <- ggplot(data=toptopic5, aes(words, topic5, fill = "orange")) +
geom_col(show.legend = FALSE) + # as a bar plot
labs(x = NULL, y = NULL, title="Topic 5") +
coord_flip()
p6 <- ggplot(data=toptopic6, aes(words, topic6, fill = "orange")) +
geom_col(show.legend = FALSE) + # as a bar plot
labs(x = NULL, y = NULL, title="Topic 6") +
coord_flip()
p7 <- ggplot(data=toptopic7, aes(words, topic7, fill = "orange")) +
geom_col(show.legend = FALSE) + # as a bar plot
labs(x = NULL, y = NULL, title="Topic 7") +
coord_flip()
p8 <- ggplot(data=toptopic8, aes(words, topic8, fill = "orange")) +
geom_col(show.legend = FALSE) + # as a bar plot
labs(x = NULL, y = NULL, title="Topic 8") +
coord_flip()
p9 <- ggplot(data=toptopic9, aes(words, topic9, fill = "orange")) +
geom_col(show.legend = FALSE) + # as a bar plot
labs(x = NULL, y = NULL, title="Topic 9") +
coord_flip()
p10 <- ggplot(data=toptopic10, aes(words, topic10, fill = "orange")) +
geom_col(show.legend = FALSE) + # as a bar plot
labs(x = NULL, y = NULL, title="Topic 10") +
coord_flip()
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, nrow = 2, newpage=TRUE )
}
plt_fn(toptopic1, toptopic2, toptopic3, toptopic4, toptopic5, toptopic6, toptopic7, toptopic8, toptopic9, toptopic10)
First we see that the term depopulation appears in all the topics except Topic 10. Focusing on the unique and frequent words, the following terms can describe each topic:
- Topic 1: Million, People, Good, Life, Good, Die
- Topic 2: World, Global, Depopulation, Plan, Surveillance
- Topic 3: Depopulation, Kill, Game, Lock-down, Systematic, Torture
- Topic 4: Bill Gates, Depopulation, DNA, Vaccine
- Topic 5: Vaccine, Covid, Mask, Test, Force,
- Topic 6: AMP, Agendum, NWO*, Elite, Evil, Mass, Control
- Topic 7: People, Great, Depopulation, Climate Change, Resource, Reset
- Topic 8: Depopulation, Agendum, Numb, Happy, Truth, Slave, British Colony
- Topic 9: Depopulation, Mass, Control, Health, World
- Topic 10: Vaxxed, Vax, Child, Find, Study
*NWO means “a conspiracy theory which hypothesizes a secretly emerging totalitarian world government” (Wikipedia n.d).
Given that all negative terms such as not, 't and the general trend of the terms appeared in the plots, even some positive terms like good and happy cannot be interpreted positively. Generally, the message that ‘COVID-19 and COVID-19 vaccines are part of the depopulation agenda of the elites’ resonate through all the topics. On top of that, Topic 3 and Topic 5 denote people’s negative view on the COVID-measures including wearing masks, lock-downs, and more. Such a negative view on the measures seems to be linked to the belief that global elites are trying to control people. When looking at Topic 7, we also notice that some may relate COVID-19 to depopulation agenda together with Climate Change as these terms appear to be engaged in the same topic. Despite minor differences in emphasis in each topic, many terms reappear in many topics and the extracted topics form a coherent theme of the Tweets of our interest - namely, conspiracy theory.
2.3 LDA with All Data
Since we did not see much distinct kinds of topics with October 2020 and November 2020 data, we aim to verify whether this is still the case when we extend the scope of the data, involving the data all the way to March 2021. Although such data contains 359688 observations, it is still doable to run the topic model since the lengths of the texts are all short.
2.3.1 Data Preprocessing
2.3.1.1 Data Cleaning
We bind the monthly separated data frames into one. Note that data_all consists of 6 lists of data frames, which is not appropriate for analyzing LDA on all the tweets across the months.
data_all_in <- rbind(oct, nov, dec, jan, feb, mar)
Taking only the Tweets from the data.
tweets_all <- data_all_in[, 'text']
# Remove unnecessary words/letters
cleaned_tweets_all <- clean(tweets_all)
data_all_in$cleaned_tweets_all <- cleaned_tweets_all
2.3.1.2 Data Preparation for LDA
We do the same preprocessing as in Section 2.2.1.
#Corpus
corp_tweets_all <- corpus(data_all_in, text_field = "cleaned_tweets_all")
#Lemmatization
lem_tweets_all <- lemmatize_strings(corp_tweets_all)
# Tokenization
toks_tweets_all <- quanteda::tokens(lem_tweets_all, what="word", remove_symbols=T)
toks_tweets_nostop_all <- tokens_remove(toks_tweets_all, c(stopwords("SMART"), "bc", "don", "isn", "aren", "de", "la"))
head(toks_tweets_nostop_all)
## Tokens consisting of 6 documents.
## text1 :
## [1] "spread" "infectious" "disease" "order" "vaccination"
## [6] "business" "expose" "swine" "flu" "expose"
## [11] "agendum"
##
## text2 :
## [1] "adjustment" "bureau" "love" "conquer" "hate"
## [6] "movieclips" "face" "powerful" "agent" "fate"
## [11] "david" "norris"
## [ ... and 6 more ]
##
## text3 :
## [1] "vaccine"
##
## text4 :
## [1] "sane" "good" "police" "port" "coquitlam"
## [6] "amp" "poison" "drug" "contaminate" "home"
## [11] "real" "trudeau"
## [ ... and 7 more ]
##
## text5 :
## [1] "put" "today" "time" "good"
##
## text6 :
## [1] "spread" "infectious" "disease" "order" "vaccination"
## [6] "business" "expose" "swine" "flu" "expose"
## [11] "agendum"
2.3.2 Frequent Words
We will only look at the frequent words in this part.
DT::datatable(word_fq(toks_tweets_nostop_all), class = "display", rownames=FALSE, options = list(pageLength = 5, autoWidth = FALSE))
2.3.3 LDA
dfm_all <- dfm(toks_tweets_nostop_all)
# Plot wordcloud to show most frequent words
textplot_wordcloud(dfm_all, max_words = 200, ordered_color = TRUE)
Similar words pop up in the word cloud such as depopulation, amp, and vaccineas before when we only accounted for October and November data. But this time, when considering all the data from October 2020 to March 2021, the most frequent term is unvaccinated. The increased discussion about vaccination makes sense since as of October 2020, the COVID-19 vaccine was developed and in the UK, the first vaccine was delivered in December 2020.
dfmtrim_all <- dfm_all %>%
dfm_trim(min_docfreq = 0.01, max_docfreq = 0.8, docfreq_type = "prop")
set.seed(0)
tweets_lda_all <- textmodel_lda(dfmtrim_all, k = 10)
DT::datatable(as.data.frame(terms(tweets_lda_all)), class = "display", rownames=FALSE, options = list(pageLength = 10, autoWidth = FALSE))
- In Topic 6: Dr. Fauci is an American immunologist and “some falsely claimed that he was involved in creating the virus in a Chinese lab” (Wikipedia n.d).
lda_results_all <- as.data.frame(t(tweets_lda_all$phi))
2.3.3.1 Plotting
lda_results_all2 <- as_tibble(lda_results_all, rownames = NA) %>%
rownames_to_column(var = 'words')
top_results(lda_results_all2)
plt_fn(toptopic1, toptopic2, toptopic3, toptopic4, toptopic5, toptopic6, toptopic7, toptopic8, toptopic9, toptopic10)
When considering all the data from October 2020 to March 2021, we recognize that the term depopulation does not appear as much as before in each topic. Instead, vaccination-related terms appeared more frequently (e.g., vaccinated, unvaccinated , antivaccine) Note again that vaccinated could mean not vaccinated in original tweets before tokenization. The following are the words that may represent each topic:
- Topic 1: Vaccinate, Unvaccinated, People, Death, Data(Datum), Israel
- Topic 2: Depopulation, Agendum, World, Reset, Amp
- Topic 3: Unvaccinated, Child, Back, Teacher, School, Die
- Topic 4: Work, Vaccine, Risk, Safe, High, Protect
- Topic 5: Amp, Vaccine, Freedom, Health
- Topic 6: Vaxxed, Dr.Facuci, Pain
- Topic 7: Vaccine, Vaccination, Mask, Make, State
- Topic 8: Vaccine, People, Covid, Kill, Test
- Topic 9: Unvaccinated, Racist, Prepare, People, Food
- Topic 10: Depopulation, Bill Gates, Antivaccine, Medical
Aligned with the results of the LDA with 2-months data sets, we again notice that all the topics seem to be linked to either conspiracy theory or people’s vaccine hesitancy. However, given some newly occurring words that were not present in the previous analysis, we will inspect which topics prevail in each month.
2.4 Topic per Month
To investigate what topics predominantly appear over the months, we will utilize the theta values that the LDA algorithm generated. These values show how each text is related to the 10 topics.
lda_theta <- as.data.frame(tweets_lda_all$theta)
monthlytopic <- cbind(data_all_in[, c('created_at', 'cleaned_tweets_all')], lda_theta)
monthlytopic$textID <- rownames(monthlytopic)
DT::datatable(head(monthlytopic), class = "display", rownames=TRUE, options = list(
pageLength = 5, autoWidth = FALSE))
monthlytopic_long <- pivot_longer(monthlytopic, cols = 3:12, names_to = "Topic", values_to = "Weights" )
top_topic_tweets <- monthlytopic_long %>%
group_by(textID) %>%
filter(Weights == max(Weights)) %>%
arrange(textID, Topic)
# Since textID is a string, text1*... comes before text2*.
date <- parse_date(top_topic_tweets$created_at, format = "%a %b %d %H:%M:%S %z %Y", na=c("", "NA"), locale=default_locale(), trim_ws = TRUE)
top_topic_tweets$date <- date
top_topic_tweets <- top_topic_tweets %>%
select(textID, Topic, Weights, date)
DT::datatable(head(top_topic_tweets), class = "display", rownames=T, options = list(pageLength = 5, autoWidth = FALSE))
top_topic_tweets$month <- format(top_topic_tweets$date, "%b") #Taking Months only
prop = top_topic_tweets %>%
group_by(month, Topic) %>%
summarise(n=n()) %>%
mutate(Freq = n/sum(n)) # proportion of Topic K per month
prop$Topic <- factor(prop$Topic, levels = c("topic1", "topic2", "topic3", "topic4", "topic5", "topic6", "topic7", "topic8", "topic9","topic10"))
library(viridis)
p<-ggplot(prop, aes(x=month, y=Freq, fill=Topic)) +
geom_bar(stat = "identity", position = "stack")+
scale_fill_viridis(discrete = TRUE, name= "Topics") +
scale_x_discrete(limits = c("Oct", "Nov", "Dec", "Jan","Feb", "Mar" )) +
theme_minimal()
p
We see that Topic 2, which is about depopulation conspiracy theory with terms such as Depopulation, Agendum, World, Reset, and Amp, was the most discussed in October and then in November 2020. This goes in tandem with the introduction of the COVID-19 vaccine to the world. Similarly, Topic 10, which can be described by the terms, Depopulation, Bill Gates, Antivaccine, and Medical, is more discussed in 2020 than in 2021. Topic 3 related to terms, Unvaccinated, Child, Back, Teacher, School, and Die, is most prominent March 2021 and then February 2021. This can be since many schools in the world start the academic year or start a semester after a break in February and March. Topic 6, concerning terms such as vaxxed, leave, and Dr. Fauci, show an ascending trend from October to December 2020 and then again from January 2021 to March 2021, while it occurred the most in March 2021. Topic 9, which is associated with terms such as unvaccinated, racist, people, and food appear most frequently in January 2021 while it is not so present in other months.
Other than these trends, we do not see an extreme variation across the months. To put it differently, all the topics are discussed in every month and no topic dominates the distribution extremely. Such a result may be due to the similarity between the topics as all of them are associated with conspiracy theory. Moreover, the fact that Muric, Wu, and Ferrara (2021) pre-selected the tweets with anti-vaccination sentiment can be a cause for the similarity between the topics. Considering this, we assess that it is not meaningful to label each tweet with different topics and perform a supervised topic analysis. Instead, we will use the insights we have gathered from analyzing the Twitter data in assessing the characteristics of the vaccine hesitancy with a survey data.
3 Study 2: Survey study
To complement the lack of personal information in the Twitter data, in the Study 2, we used survey data to explore the behind reasons why people oppose vaccines from a different perspective.
3.1 Packages
# getwd()
# rm(list=ls())
####Load the packages####
#data manipulation
library("haven")
library("dplyr")
library('base')
#data visualization
library(knitr)
#factor analysis
library(car)
library(psych)
#decision tree
library("rpart")
library("rpart.plot")
library(vip)
#random forest
library(randomForest)
library(caret)
library(e1071)
3.2 Material and methods
3.2.1 Data collection and processing
The data used in Study 2 - ‘Living, Working and COVID-19 (round 3)’ - was collected by Eurofound from EU-27 countries Ahrendt et al. (2020). The access of the preliminary data set was granted on 23 May 2022 by Eurofound.
Participants of this survey were recruited via uncontrolled convenience sampling, specifically by advertising on social media and distributing to Eurofound’s stakeholders. The survey was released online via the SoSciSurvey platform. The round 3 data collection took place on 15 February 2021 and ended on 30 March 2021 in EU-27 countries.
Our accessed data set was already cleaned by Eurofound researchers to exclude partial interviews. The missing values (i.e., ‘Don’t know/Prefer not to answer,’ ‘Not applicable’ and skipped questions) have been coded as system missing.
In the current study, variables measured by Likert-scale with more than 5 levels were treated as continuous variables Knapp (1990). Listwise deletion was used to deal with missing data.
3.2.2 Outcome measure: vaccine hesitancy
Vaccine hesitancy was accessed in the survey by asking participants “How likely or unlikely is it that you will take the COVID-19 vaccine when it becomes available to you?” Participants were asked to reported their intention on a 5-point Likert scale from 1 ‘very likely’ to 5 ‘very unlikely’.
Because we are interested in understanding why people are unwilling to receive vaccine, those who answered 3 ‘neither likely nor unlikely’ were removed from our analysis for showing ambiguous attitudes. Then, we categorized respondents who answered ‘very likely’ and ‘rather likely’ as people without vaccine hesitancy, and categorized the respondents who answered ‘rather unlikely’ and ‘very unlikely’ as ones with vaccine hesitancy.
In our final sample, 23.7% respondents have vaccine hesitancy and 76.3% do not have it.
#load data
md <- read_sav("lwc_r1_r2_r3_public_09112021.sav")
##round 3 data with interested variables
mdv <- md %>%
filter(wave == 3,C314_01 != 3) %>%
dplyr::select(
#demographic
C008, #Urbanization level
B002, #gender
B003_01, #age
F344,#education
D001,#employment status
#well-being
C001_01, #life satisfaction
who5,
#resilience
C003_03,
C003_04,
#health
C004_01,
#trust
C007_01,
C007_02,
C007_03,
C007_04,
C007_05,
C307_06,
C307_07,
C307_08,
C312_01,
#covid-19 experience
C316_01,
C316_02,
C316_03,
#media
C319,
#economic
E006,
#decision
C314_01,
#optimistic
C003_01,
#country
B001,
#tense
C006_01
)
#rename column
colnames(mdv) <- c('UrbanisationLevel', 'Gender', 'Age', 'Education', 'EmploymentStatus',
'LifeSatisfaction','WHO5',
'Resilience1','Resilience2',
'GeneralHealth',
'TrustNewsMedia','TrustPolice','TrustGov','TrustEU','TrustHealthcare','TrustSocialMedia', 'TrustScience','TrustPharFirm','TrustPeople',
'GotCovid','CloseGotCovid','CloseDieCovid',
'NewsSource',
'HouseSaving','AntiVac',
'SelfOptimistic','Country','Tense'
)
#sapply(mdv,attr,'label')
#category 1,2,4,5,20:25,27
#continuous 3,6:19,11:19,26,28
#set type of variables
mdv[c(1,2,4,5,20:25,27)] <- lapply(mdv[c(1,2,4,5,20:25,27)],factor)
mdv[c(3,6:19,26,28)] <- lapply(mdv[c(3,6:19,26,28)],as.numeric)
mdv$AntiVac <- ifelse(mdv$AntiVac %in% c('1','2'),'support','anti')
mdv$AntiVac <- as.factor(mdv$AntiVac)
mdv<-as.data.frame(na.omit(mdv))
table(mdv$AntiVac)
##
## anti support
## 7297 23473
# mdv data frame
DT::datatable(mdv, class = "display", rownames=FALSE, options = list(
pageLength = 5, autoWidth = FALSE))
3.2.3 Predictors measure
Several interested variables in the data set were selected to build a model that could predict individual’s hesitancy of vaccination, and explore their importance in contributing vaccine hesitancy. We mainly focus on variables related with trust, well-being, COVID-19 experience and socioeconomic and demographic variables.
3.2.3.1 Predictor variables: trust
Trust was proved to correlates with COVID-19 vaccine hesitancy in many research (e.g., Schernhammer et al. 2022; Mewhirter, Sagir, and Sanders 2022). One possible explanation is: the pandemic generated widespread disinformation that has undermines the acceptance and trust of science and policy, which extends to the issue of vaccine hesitancy Rawlings, Looi, and Robson (n.d.).
According to preliminary results of factor analysis, we divided trust into four factors: (1) trust in institutions: trust in police, trust in healthcare system, trust in government, trust in news media; (2) trust in science and pharmaceutical firms; (3) interpersonal trust: trust in people, trust in social media; (4) trust in European Union.
KMO(r = cor(mdv[11:19]))
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(mdv[11:19]))
## Overall MSA = 0.9
## MSA for each item =
## TrustNewsMedia TrustPolice TrustGov TrustEU
## 0.92 0.89 0.90 0.89
## TrustHealthcare TrustSocialMedia TrustScience TrustPharFirm
## 0.89 0.89 0.89 0.91
## TrustPeople
## 0.94
t(cortest.bartlett(mdv[11:19]))
## chisq p.value df
## [1,] 117948.6 0 36
# KMO and Bartlett's test results indicate FA will be useful
parallel <- fa.parallel(mdv[11:19])
## Parallel analysis suggests that the number of factors = 4 and the number of components = 1
#parallel analysis suggests that the number of factors = 4
fa.none <- fa(r=mdv[c(11:19)],
nfactors = 4,
# covar = FALSE, SMC = TRUE,
fm='pa', # type of factor analysis we want to use (“pa” is principal axis factoring)
max.iter=300,
rotate='varimax') # none rotation
fa.diagram(fa.none)
3.2.3.2 Predictor variables: well-being
Well-being variables were considered due to previous finding indicated that it is the anxiety and fear rather than hope driving people to take the vaccine Bullock, Lane, and Shults (2022). Therefore, in the current study, we take into account of optimistic, tense feeling, general life satisfaction, general subjective well-being, general health (general physical well-being) into account, to investigate whether they impact individual’s vaccine hesitancy.
3.2.3.3 Predictor variables: COVID-19 experience
A person’s experience with COVID-19 may influence their willingness to be vaccinated. For example, the death of someone around us because of COVID-19 may make us feel the danger of COVID-19, and thus more willing to get vaccinated. In the current study, we included three variables that related to COVID-19 experience. These variables accessed whether (1) one has tested positive of COVID-19, (2) one has close one tested positive of COVID-19, (3) one has close one died because of COVID-19.
3.2.3.4 Predictor variables: socioeconomic and demography
Based on previous findings, we consider a number of socioeconomic and demographic variables that are potentially associated with vaccine hesitancy Lee and Huang (2022); Mewhirter, Sagir, and Sanders (2022). We included 2 socioeconomic variables (i.e., variables that reflect an individual’s status within a community): employment status, house saving. For demographic variables, we included 5: urbanization level, education level, country, gender, age.
3.2.3.5 Predictor variables: news source, self-resilience
Finally, we considered news source and resilience as predictors of vaccine hesitancy. Online misinformation about COVID-19 vaccine was found linking to COVID-19 vaccination hesitancy and refusal Pierri et al. (2022). Compared to television programs, “news” published on social media are more likely to contain misinformation. For example, in Study 1, we found that anti-vaxxers often talk about topics related to conspiracy theory, and they are highly likely to see these mis- or unscientific information on social media (e.g., Twitter, Facebook) rather than from TV programs. Previous research has found that vaccine hesitancy was higher among those with high self-resilience than those with low self-resilience Schernhammer et al. (2022). Therefore, the self-resilience was also considered in this study. Self-resilience were accessed by two questions: (1) I find it difficult to deal with important problems that come up in my life, (2) When things go wrong in my life, it generally takes me a long time to get back to normal.
3.3 Statistical models
Reference:
Some codes from:
are adopted in Decision Tree model generation part.
Some codes from:
are adopted in doing factor analysis of trust-related variables.
Some codes from:
are adopted in the Random Forest model generation part.
Two supervised machine learning methods, Decision Trees and Random Forest , are used to investigate our research question. These two methods were considered because they can compensate for each other. Decision tree is a simple and effective decision-making diagram, which can handle mixed data (i.e., categorical and continuous data). While random forest is an extension of Bagging both of which improve decision trees. Although decision trees are easy to interpret and visualize, it has potential to over-fit the data or derive a model that is not accurate enough. Random forest compensates these issues of decision trees by yielding more accurate and precise results and by preventing the model from over-fitting as it uses multiple trees with bootstrap samples. However, interpretation is more difficult than the decision trees due to its complex visualization of the results and it requires more computational power.
3.3.1 Train and test set preparation
In total, 30770 respondents were entered into the final analysis. We split the data into train- and test data sets, in the ratio of 75:25. 75% of the data serves to train and create the decision tree model and random forest model, while the rest make predictions and evaluate the accuracy of the model.
set.seed(377)
smp_size <- floor(0.75 * nrow(mdv))
train_ind <- sample(seq_len(nrow(mdv)), size = smp_size)
train <- mdv[train_ind, ]
test <- mdv[-train_ind, ]
3.3.2 Decision trees
3.3.2.1 Modeling and pruning - with all variables
We first created a base model with low complexity parameter (cp = 0), which means that we allow decision tree to fully grow. Then, we pruned the model following Breiman’s (1984) 1-SE rule (i.e., in practice, choose the simplest tree within one standard error of the best tree) (Breiman et al. 1984). According to the output, the best tree is the one with lowest cross-validation relative error xerror = 0.67244. Based on 1SE rule, we chose the tree in row 8 (xerror = 0.67919), which is lower than 0.67244 + 0.010151 = 0.682591.
#base model
#In R, we use 'rpart' package for "Recrusive Partitioning and Regression Trees". Because we want to predict whether a respondent is 'anti' or 'support' vaccination, so we chose method = "class".
set.seed(377)
tree.base <- rpart(AntiVac~.,train,method = 'class',control = rpart.control(cp=0))
printcp(tree.base)
##
## Classification tree:
## rpart(formula = AntiVac ~ ., data = train, method = "class",
## control = rpart.control(cp = 0))
##
## Variables actually used in tree construction:
## [1] Age CloseDieCovid CloseGotCovid Country
## [5] Education EmploymentStatus Gender GeneralHealth
## [9] GotCovid HouseSaving LifeSatisfaction NewsSource
## [13] Resilience1 Resilience2 SelfOptimistic Tense
## [17] TrustEU TrustGov TrustHealthcare TrustNewsMedia
## [21] TrustPeople TrustPharFirm TrustPolice TrustScience
## [25] TrustSocialMedia UrbanisationLevel WHO5
##
## Root node error: 5483/23077 = 0.2376
##
## n= 23077
##
## CP nsplit rel error xerror xstd
## 1 2.0135e-01 0 1.00000 1.00000 0.011792
## 2 5.9821e-02 1 0.79865 0.79519 0.010846
## 3 1.6962e-02 2 0.73883 0.74047 0.010549
## 4 9.1191e-03 3 0.72187 0.72862 0.010482
## 5 8.9367e-03 6 0.69451 0.71330 0.010394
## 6 6.2010e-03 8 0.67664 0.69177 0.010268
## 7 3.5564e-03 9 0.67044 0.68393 0.010221
## 8 3.1005e-03 11 0.66332 0.67919 0.010192
## 9 2.5533e-03 14 0.65402 0.67955 0.010194
## 10 2.4926e-03 15 0.65147 0.67919 0.010192
## 11 2.4622e-03 18 0.64399 0.67846 0.010188
## 12 2.3710e-03 20 0.63907 0.67810 0.010186
## 13 1.8238e-03 24 0.62958 0.67427 0.010162
## 14 1.4591e-03 25 0.62776 0.67390 0.010160
## 15 1.3679e-03 26 0.62630 0.67317 0.010156
## 16 1.2767e-03 28 0.62356 0.67244 0.010151
## 17 1.2159e-03 29 0.62229 0.67408 0.010161
## 18 1.0943e-03 40 0.60806 0.67682 0.010178
## 19 1.0031e-03 43 0.60478 0.67572 0.010171
## 20 9.7270e-04 52 0.59548 0.67627 0.010175
## 21 9.1191e-04 64 0.58253 0.67572 0.010171
## 22 8.5112e-04 74 0.57268 0.68047 0.010200
## 23 8.2072e-04 81 0.56575 0.68247 0.010212
## 24 7.2953e-04 90 0.55772 0.67864 0.010189
## 25 6.6873e-04 114 0.54003 0.67500 0.010167
## 26 6.3834e-04 119 0.53639 0.67919 0.010192
## 27 5.9274e-04 122 0.53438 0.67919 0.010192
## 28 5.4715e-04 136 0.52362 0.67755 0.010182
## 29 4.9741e-04 174 0.50137 0.68047 0.010200
## 30 4.8635e-04 191 0.49024 0.68503 0.010228
## 31 4.7419e-04 199 0.48587 0.68503 0.010228
## 32 4.5595e-04 206 0.48222 0.68667 0.010237
## 33 4.2556e-04 229 0.47073 0.68922 0.010253
## 34 4.1036e-04 232 0.46945 0.68922 0.010253
## 35 3.6476e-04 240 0.46617 0.70326 0.010336
## 36 3.1917e-04 274 0.45358 0.70728 0.010359
## 37 3.0397e-04 278 0.45231 0.71895 0.010427
## 38 2.7357e-04 306 0.44264 0.72442 0.010459
## 39 2.4318e-04 314 0.44045 0.72697 0.010473
## 40 2.3449e-04 320 0.43899 0.72752 0.010476
## 41 2.2798e-04 328 0.43699 0.72825 0.010480
## 42 1.8238e-04 346 0.43206 0.72825 0.010480
## 43 1.5198e-04 400 0.42203 0.74175 0.010557
## 44 1.4591e-04 412 0.42021 0.74430 0.010571
## 45 1.3679e-04 417 0.41948 0.74485 0.010574
## 46 1.2159e-04 430 0.41729 0.74558 0.010578
## 47 9.1191e-05 450 0.41455 0.74649 0.010583
## 48 8.1059e-05 469 0.41273 0.75616 0.010636
## 49 6.8393e-05 478 0.41200 0.75798 0.010646
## 50 6.0794e-05 486 0.41145 0.76254 0.010671
## 51 4.5595e-05 492 0.41109 0.76491 0.010684
## 52 3.6476e-05 500 0.41072 0.76874 0.010705
## 53 0.0000e+00 505 0.41054 0.77166 0.010721
#16 1.2767e-03 28 0.62356 0.67244 0.010151
Hence, we cut the tree with cp = 3.1005e-03 in row 8, and compared the accuracy of base model and pruned model. From the output results, the pruned model has better accuracy (83.5%), which indicates that the pruned model is better suited for the production environment. Therefore, we decided the pruned model as our final decision tree model.
#pruned model
set.seed(377)
tree.prune <- rpart(AntiVac~.,train,method = 'class',control = rpart.control(cp=3.1005e-03))
#accuracy of base model
pred.base <- predict(tree.base, test,type='class')
base.mat <- table(test$AntiVac, pred.base)
base.accuracy <- sum(diag(base.mat)) / sum(base.mat)
#accuracy of pruned model
pred.prune <- predict(tree.prune, test,type='class')
prune.mat <- table(test$AntiVac, pred.prune)
prune.accuracy <- sum(diag(prune.mat)) / sum(prune.mat)
data.frame(base.accuracy,prune.accuracy)
## base.accuracy prune.accuracy
## 1 0.8177564 0.8350448
The following shows the rules and the top 20 features of final decision tree model.
According to the rule of the decision plot, we can see from the root node on the top of the tree that 76% respondents among our final data do not have vaccine hesitancy, while 24% do. Travelling down the tree branches, we see that if respondents trust in pharmacy firm is higher than 3 out of 10, there are 87% respondents are likely to take vaccination, and so forth.
As demonstrated in the importance feature plot, trust variables and country play an important role in predicting vaccine hesitancy compared to well-being variables, COVID-experience variables, and other socioeconomic and demographic variables. Among trust variables, the trust in the pharmaceutical firms and science are the two most important features. The second is trust in government institutions and trust in EU. However, interpersonal trust appears rather less important.
These results are in line with the findings by Mewhirter, Sagir, and Sanders (2022). Mewhirter, Sagir, and Sanders (2022) also found that trust and some demographic variables (e.g., community) has strong association with vaccine hesitancy. However, they argued that researchers should consider deeper sources, instead of relying on demographic patterns when studying public health. It could be that people who distrust vaccine clustered in certain areas thus shaping demographic pattern. Considering this, we built a model with only trust in the pharmaceutical firms and science, trust in government institutions and trust in EU, and checked its accuracy.
rpart.plot(tree.prune)
vip(tree.prune, num_features = 20, bar = F)
3.3.2.2 Modeling and pruning - with trust variables
The same modeling and pruning process have done with a model with only trust variables. The results showed that the model with only trust variables (i.e., trust in the pharmaceutical firms and science, trust in government institutions and trust in EU) has the similar accuracy (82.85%) as the model with all variables (83.5%).
#base model - trust
set.seed(377)
tree.base.t <- rpart(AntiVac~ `TrustNewsMedia` + `TrustPolice` + `TrustGov` + `TrustEU` + `TrustHealthcare` + `TrustScience` + `TrustPharFirm`,train,method = 'class',control = rpart.control(cp=0))
printcp(tree.base.t)
##
## Classification tree:
## rpart(formula = AntiVac ~ TrustNewsMedia + TrustPolice + TrustGov +
## TrustEU + TrustHealthcare + TrustScience + TrustPharFirm,
## data = train, method = "class", control = rpart.control(cp = 0))
##
## Variables actually used in tree construction:
## [1] TrustEU TrustGov TrustHealthcare TrustNewsMedia
## [5] TrustPharFirm TrustPolice TrustScience
##
## Root node error: 5483/23077 = 0.2376
##
## n= 23077
##
## CP nsplit rel error xerror xstd
## 1 2.0135e-01 0 1.00000 1.00000 0.011792
## 2 5.9821e-02 1 0.79865 0.79519 0.010846
## 3 6.2010e-03 2 0.73883 0.74047 0.010549
## 4 3.3741e-03 4 0.72643 0.73208 0.010502
## 5 2.5533e-03 6 0.71968 0.73190 0.010501
## 6 2.4622e-03 7 0.71713 0.73226 0.010503
## 7 2.0062e-03 9 0.71220 0.73190 0.010501
## 8 1.7326e-03 15 0.69980 0.72989 0.010490
## 9 1.6414e-03 17 0.69633 0.72570 0.010466
## 10 1.2767e-03 20 0.69141 0.72241 0.010447
## 11 1.0943e-03 22 0.68886 0.72023 0.010434
## 12 9.1191e-04 23 0.68776 0.72169 0.010443
## 13 8.0248e-04 25 0.68594 0.72624 0.010469
## 14 7.2953e-04 33 0.67901 0.72570 0.010466
## 15 6.6873e-04 36 0.67682 0.72679 0.010472
## 16 6.0794e-04 39 0.67481 0.72734 0.010475
## 17 5.4715e-04 42 0.67299 0.72552 0.010465
## 18 4.9504e-04 57 0.66478 0.72460 0.010460
## 19 4.8635e-04 65 0.66077 0.72442 0.010459
## 20 4.5595e-04 78 0.65402 0.72497 0.010462
## 21 4.2556e-04 86 0.65037 0.72424 0.010457
## 22 3.6476e-04 96 0.64600 0.72588 0.010467
## 23 3.2829e-04 139 0.63013 0.72770 0.010477
## 24 3.1265e-04 145 0.62812 0.72898 0.010485
## 25 2.7357e-04 153 0.62539 0.73062 0.010494
## 26 2.5533e-04 169 0.62101 0.73555 0.010522
## 27 2.4318e-04 174 0.61973 0.73828 0.010537
## 28 2.2798e-04 195 0.61408 0.74011 0.010547
## 29 2.1278e-04 203 0.61226 0.74065 0.010550
## 30 1.8238e-04 243 0.60204 0.75251 0.010616
## 31 1.4591e-04 294 0.59201 0.75579 0.010634
## 32 1.2159e-04 315 0.58855 0.75999 0.010657
## 33 1.0943e-04 330 0.58581 0.76181 0.010667
## 34 1.0422e-04 335 0.58526 0.76181 0.010667
## 35 9.1191e-05 342 0.58453 0.77020 0.010713
## 36 6.0794e-05 362 0.58271 0.77366 0.010731
## 37 5.2109e-05 392 0.58089 0.77713 0.010750
## 38 4.5595e-05 402 0.58034 0.78242 0.010778
## 39 3.6476e-05 410 0.57997 0.78260 0.010779
## 40 1.6580e-05 424 0.57924 0.78862 0.010811
## 41 0.0000e+00 435 0.57906 0.78953 0.010816
#11 1.0943e-03 22 0.68886 0.71804 0.010422
# target: 0.728462 optimal cp = 2.5533e-03
#pruned model
set.seed(377)
tree.prune.t <- rpart(AntiVac~ `TrustNewsMedia` + `TrustPolice` + `TrustGov` + `TrustEU` + `TrustHealthcare` + `TrustScience` + `TrustPharFirm`,train,method = 'class',control = rpart.control(cp=2.5533e-03))
#accuracy of base model
pred.base.t <- predict(tree.base.t, test,type='class')
base.mat.t <- table(test$AntiVac, pred.base.t)
base.accuracy.t <- sum(diag(base.mat.t)) / sum(base.mat.t)
#accuracy of pruned model
pred.prune.t <- predict(tree.prune.t, test,type='class')
prune.mat.t <- table(test$AntiVac, pred.prune.t)
prune.accuracy.t <- sum(diag(prune.mat.t)) / sum(prune.mat.t)
data.frame(base.accuracy.t,prune.accuracy.t,prune.accuracy)
## base.accuracy.t prune.accuracy.t prune.accuracy
## 1 0.8151566 0.8285454 0.8350448
Same as the model with all variables.
rpart.plot(tree.prune.t)
vip(tree.prune.t, num_features = 20, bar = F)
We tried to model with only trust in pharmaceutical firms and trust in science as predictors and the result showed that the accuracy is 82.39%., which is very close to the previous model with more predictors (82.85%).
#base model - with only 2 trust variables
set.seed(377)
tree.base.o <- rpart(AntiVac~`TrustPharFirm` + `TrustScience`,train,method = 'class',control = rpart.control(cp=0))
printcp(tree.base.o)
##
## Classification tree:
## rpart(formula = AntiVac ~ TrustPharFirm + TrustScience, data = train,
## method = "class", control = rpart.control(cp = 0))
##
## Variables actually used in tree construction:
## [1] TrustPharFirm TrustScience
##
## Root node error: 5483/23077 = 0.2376
##
## n= 23077
##
## CP nsplit rel error xerror xstd
## 1 2.0135e-01 0 1.00000 1.00000 0.011792
## 2 3.1917e-02 1 0.79865 0.79847 0.010863
## 3 1.6232e-02 2 0.76673 0.76673 0.010694
## 4 8.9367e-03 3 0.75050 0.75397 0.010624
## 5 3.9082e-04 4 0.74156 0.74631 0.010582
## 6 4.5595e-05 11 0.73883 0.74576 0.010579
## 7 0.0000e+00 15 0.73865 0.74667 0.010584
# target: 0.756339 optimal cp = 8.9367e-03
#pruned model - with only 2 trust variables
set.seed(377)
tree.prune.o <- rpart(AntiVac~`TrustPharFirm` + `TrustScience`,train,method = 'class',control = rpart.control(cp=8.9367e-03, minbucket = 5))
#accuracy of base model
pred.base.o <- predict(tree.base.o, test,type='class')
base.mat.o <- table(test$AntiVac, pred.base.o)
base.accuracy.o <- sum(diag(base.mat.o)) / sum(base.mat.o)
#accuracy of pruned model
pred.prune.o <- predict(tree.prune.o, test,type='class')
prune.mat.o <- table(test$AntiVac, pred.prune.o)
prune.accuracy.o <- sum(diag(prune.mat.o)) / sum(prune.mat.o)
data.frame(base.accuracy.o,prune.accuracy.o,prune.accuracy.t, prune.accuracy)
## base.accuracy.o prune.accuracy.o prune.accuracy.t prune.accuracy
## 1 0.8229559 0.8239958 0.8285454 0.8350448
3.3.3 Random Forests
3.3.3.1 Modeling and tuning - with all variables
We first run a default random forest model, and then tuned the default model to find a better solution.
The output of the default random forest model shows that 500 trees and tested 5 mtry values. The out-of-bag (OOB) error (i.e., prediction error of out-of-bags observations) is 14.46%. The do.traceargument allows us to see the change of OOB with the increase number of trees. As seen in the plot of OOB, the OBB has stabled when ntree = 500.
set.seed(377)
# Run the model
model.rf <- randomForest(`AntiVac`~., data = train, do.trace = T)
## ntree OOB 1 2
## 1: 22.88% 50.27% 14.26%
## 2: 23.01% 47.72% 15.23%
## 3: 22.99% 46.67% 15.58%
## 4: 22.13% 45.22% 14.97%
## 5: 21.75% 45.59% 14.38%
## 6: 21.19% 45.29% 13.71%
## 7: 20.64% 45.64% 12.85%
## 8: 20.24% 45.50% 12.36%
## 9: 19.60% 45.02% 11.68%
## 10: 19.35% 45.31% 11.26%
## 11: 19.03% 45.37% 10.82%
## 12: 18.30% 45.00% 9.98%
## 13: 18.14% 45.54% 9.61%
## 14: 17.91% 45.39% 9.35%
## 15: 17.68% 45.14% 9.13%
## 16: 17.38% 45.15% 8.73%
## 17: 17.29% 45.45% 8.51%
## 18: 16.94% 45.05% 8.19%
## 19: 16.97% 45.45% 8.10%
## 20: 16.59% 44.98% 7.74%
## 21: 16.58% 45.29% 7.63%
## 22: 16.40% 45.30% 7.39%
## 23: 16.43% 45.63% 7.33%
## 24: 16.25% 45.76% 7.05%
## 25: 16.08% 45.74% 6.83%
## 26: 16.03% 45.47% 6.86%
## 27: 15.86% 45.60% 6.59%
## 28: 15.89% 45.91% 6.54%
## 29: 16.06% 46.45% 6.59%
## 30: 15.83% 45.87% 6.47%
## 31: 15.68% 45.76% 6.31%
## 32: 15.61% 45.74% 6.22%
## 33: 15.51% 45.74% 6.09%
## 34: 15.53% 45.81% 6.10%
## 35: 15.51% 45.80% 6.07%
## 36: 15.54% 45.81% 6.10%
## 37: 15.51% 45.91% 6.04%
## 38: 15.40% 45.78% 5.93%
## 39: 15.38% 45.72% 5.92%
## 40: 15.38% 45.76% 5.91%
## 41: 15.26% 45.78% 5.75%
## 42: 15.46% 46.22% 5.88%
## 43: 15.31% 45.74% 5.83%
## 44: 15.32% 45.89% 5.79%
## 45: 15.27% 45.61% 5.82%
## 46: 15.25% 45.65% 5.77%
## 47: 15.29% 45.78% 5.79%
## 48: 15.24% 45.54% 5.80%
## 49: 15.19% 45.54% 5.73%
## 50: 15.15% 45.69% 5.63%
## 51: 15.19% 45.74% 5.67%
## 52: 15.16% 45.83% 5.60%
## 53: 15.13% 45.70% 5.60%
## 54: 15.12% 45.74% 5.58%
## 55: 15.14% 45.92% 5.55%
## 56: 15.15% 46.01% 5.53%
## 57: 15.19% 46.23% 5.51%
## 58: 15.23% 46.31% 5.54%
## 59: 15.21% 46.31% 5.51%
## 60: 15.19% 46.31% 5.50%
## 61: 15.07% 46.11% 5.40%
## 62: 15.11% 46.14% 5.43%
## 63: 15.06% 46.03% 5.41%
## 64: 15.00% 45.87% 5.38%
## 65: 15.04% 46.00% 5.39%
## 66: 14.95% 45.98% 5.28%
## 67: 14.98% 46.03% 5.31%
## 68: 15.01% 46.00% 5.36%
## 69: 14.96% 45.89% 5.33%
## 70: 14.90% 45.81% 5.27%
## 71: 14.88% 45.85% 5.22%
## 72: 14.91% 45.87% 5.26%
## 73: 14.91% 45.87% 5.26%
## 74: 14.84% 45.72% 5.22%
## 75: 14.85% 45.65% 5.25%
## 76: 14.87% 45.85% 5.22%
## 77: 14.82% 45.80% 5.17%
## 78: 14.86% 45.91% 5.18%
## 79: 14.79% 45.87% 5.10%
## 80: 14.84% 46.07% 5.11%
## 81: 14.76% 45.85% 5.07%
## 82: 14.84% 45.98% 5.14%
## 83: 14.85% 45.96% 5.16%
## 84: 14.80% 45.92% 5.10%
## 85: 14.79% 45.81% 5.13%
## 86: 14.81% 45.96% 5.10%
## 87: 14.75% 45.76% 5.09%
## 88: 14.76% 45.85% 5.08%
## 89: 14.76% 45.80% 5.09%
## 90: 14.72% 45.74% 5.05%
## 91: 14.74% 45.74% 5.08%
## 92: 14.71% 45.63% 5.08%
## 93: 14.71% 45.85% 5.01%
## 94: 14.75% 45.83% 5.06%
## 95: 14.72% 45.87% 5.01%
## 96: 14.74% 45.67% 5.10%
## 97: 14.72% 45.78% 5.04%
## 98: 14.74% 45.85% 5.04%
## 99: 14.68% 45.72% 5.00%
## 100: 14.69% 45.74% 5.01%
## 101: 14.75% 46.03% 5.00%
## 102: 14.70% 45.91% 4.97%
## 103: 14.67% 45.87% 4.95%
## 104: 14.63% 45.81% 4.91%
## 105: 14.65% 45.87% 4.92%
## 106: 14.59% 45.58% 4.93%
## 107: 14.65% 45.69% 4.97%
## 108: 14.62% 45.78% 4.92%
## 109: 14.56% 45.54% 4.90%
## 110: 14.65% 45.80% 4.94%
## 111: 14.60% 45.54% 4.96%
## 112: 14.67% 45.85% 4.96%
## 113: 14.64% 45.70% 4.96%
## 114: 14.66% 45.81% 4.94%
## 115: 14.68% 45.94% 4.93%
## 116: 14.69% 45.78% 5.00%
## 117: 14.61% 45.70% 4.92%
## 118: 14.62% 45.81% 4.91%
## 119: 14.68% 45.85% 4.96%
## 120: 14.68% 45.96% 4.93%
## 121: 14.66% 45.94% 4.91%
## 122: 14.67% 46.00% 4.91%
## 123: 14.66% 45.87% 4.93%
## 124: 14.61% 45.69% 4.92%
## 125: 14.64% 45.98% 4.87%
## 126: 14.64% 45.78% 4.94%
## 127: 14.59% 45.78% 4.88%
## 128: 14.53% 45.60% 4.84%
## 129: 14.56% 45.72% 4.85%
## 130: 14.55% 45.76% 4.83%
## 131: 14.55% 45.72% 4.84%
## 132: 14.56% 45.67% 4.87%
## 133: 14.59% 45.91% 4.83%
## 134: 14.65% 46.11% 4.85%
## 135: 14.66% 46.03% 4.88%
## 136: 14.63% 46.00% 4.85%
## 137: 14.61% 45.92% 4.85%
## 138: 14.54% 45.74% 4.82%
## 139: 14.59% 45.98% 4.81%
## 140: 14.57% 45.91% 4.81%
## 141: 14.64% 46.09% 4.84%
## 142: 14.59% 45.92% 4.83%
## 143: 14.60% 45.94% 4.83%
## 144: 14.55% 45.91% 4.78%
## 145: 14.57% 45.91% 4.80%
## 146: 14.60% 45.83% 4.87%
## 147: 14.59% 45.85% 4.84%
## 148: 14.58% 45.81% 4.84%
## 149: 14.62% 46.00% 4.84%
## 150: 14.62% 46.03% 4.83%
## 151: 14.64% 45.91% 4.90%
## 152: 14.64% 45.92% 4.89%
## 153: 14.62% 45.81% 4.89%
## 154: 14.58% 45.87% 4.83%
## 155: 14.53% 45.80% 4.79%
## 156: 14.50% 45.85% 4.73%
## 157: 14.54% 45.89% 4.77%
## 158: 14.57% 45.94% 4.80%
## 159: 14.58% 45.91% 4.82%
## 160: 14.55% 45.81% 4.80%
## 161: 14.59% 46.07% 4.78%
## 162: 14.57% 45.98% 4.78%
## 163: 14.51% 45.83% 4.75%
## 164: 14.56% 46.00% 4.76%
## 165: 14.59% 46.03% 4.79%
## 166: 14.55% 46.11% 4.72%
## 167: 14.59% 46.16% 4.76%
## 168: 14.55% 46.01% 4.75%
## 169: 14.52% 45.92% 4.73%
## 170: 14.58% 46.03% 4.78%
## 171: 14.49% 45.91% 4.70%
## 172: 14.53% 45.92% 4.74%
## 173: 14.54% 45.92% 4.76%
## 174: 14.56% 46.00% 4.77%
## 175: 14.53% 45.98% 4.73%
## 176: 14.50% 45.85% 4.73%
## 177: 14.51% 45.87% 4.74%
## 178: 14.48% 45.70% 4.75%
## 179: 14.50% 45.78% 4.76%
## 180: 14.53% 45.92% 4.75%
## 181: 14.49% 45.83% 4.72%
## 182: 14.50% 45.89% 4.72%
## 183: 14.51% 45.94% 4.72%
## 184: 14.50% 45.98% 4.69%
## 185: 14.46% 45.91% 4.66%
## 186: 14.44% 45.78% 4.68%
## 187: 14.47% 45.85% 4.69%
## 188: 14.45% 45.85% 4.67%
## 189: 14.50% 45.91% 4.71%
## 190: 14.46% 45.81% 4.69%
## 191: 14.43% 45.72% 4.68%
## 192: 14.46% 45.67% 4.74%
## 193: 14.48% 45.92% 4.68%
## 194: 14.46% 45.85% 4.68%
## 195: 14.45% 45.85% 4.66%
## 196: 14.47% 45.87% 4.68%
## 197: 14.45% 45.80% 4.68%
## 198: 14.45% 45.85% 4.67%
## 199: 14.49% 45.92% 4.69%
## 200: 14.49% 45.91% 4.69%
## 201: 14.47% 45.81% 4.70%
## 202: 14.51% 45.94% 4.72%
## 203: 14.50% 45.94% 4.71%
## 204: 14.49% 45.92% 4.69%
## 205: 14.48% 45.83% 4.71%
## 206: 14.48% 45.91% 4.68%
## 207: 14.48% 45.94% 4.68%
## 208: 14.49% 45.94% 4.69%
## 209: 14.47% 45.92% 4.67%
## 210: 14.46% 45.94% 4.64%
## 211: 14.45% 45.98% 4.62%
## 212: 14.51% 46.07% 4.67%
## 213: 14.50% 46.03% 4.68%
## 214: 14.53% 46.12% 4.68%
## 215: 14.51% 46.09% 4.67%
## 216: 14.51% 46.01% 4.69%
## 217: 14.47% 45.91% 4.67%
## 218: 14.47% 45.89% 4.68%
## 219: 14.48% 45.96% 4.67%
## 220: 14.45% 45.85% 4.67%
## 221: 14.46% 45.91% 4.66%
## 222: 14.53% 46.09% 4.69%
## 223: 14.49% 46.03% 4.65%
## 224: 14.48% 45.96% 4.67%
## 225: 14.46% 45.87% 4.67%
## 226: 14.48% 45.96% 4.67%
## 227: 14.45% 45.85% 4.67%
## 228: 14.46% 45.94% 4.65%
## 229: 14.46% 45.83% 4.68%
## 230: 14.47% 45.91% 4.67%
## 231: 14.49% 45.98% 4.68%
## 232: 14.45% 45.83% 4.67%
## 233: 14.48% 45.94% 4.68%
## 234: 14.48% 45.87% 4.70%
## 235: 14.42% 45.76% 4.65%
## 236: 14.44% 45.81% 4.66%
## 237: 14.48% 45.81% 4.72%
## 238: 14.50% 45.96% 4.69%
## 239: 14.53% 46.01% 4.71%
## 240: 14.51% 46.01% 4.69%
## 241: 14.49% 45.91% 4.69%
## 242: 14.52% 45.92% 4.73%
## 243: 14.53% 46.09% 4.69%
## 244: 14.52% 45.96% 4.72%
## 245: 14.48% 45.91% 4.69%
## 246: 14.50% 45.96% 4.69%
## 247: 14.49% 46.11% 4.63%
## 248: 14.48% 45.98% 4.67%
## 249: 14.48% 45.94% 4.67%
## 250: 14.47% 45.91% 4.68%
## 251: 14.49% 45.98% 4.68%
## 252: 14.46% 45.94% 4.65%
## 253: 14.47% 45.94% 4.66%
## 254: 14.45% 45.83% 4.67%
## 255: 14.46% 45.87% 4.67%
## 256: 14.46% 45.98% 4.64%
## 257: 14.50% 46.11% 4.65%
## 258: 14.46% 46.00% 4.64%
## 259: 14.47% 45.96% 4.66%
## 260: 14.46% 45.92% 4.66%
## 261: 14.46% 45.94% 4.65%
## 262: 14.46% 46.00% 4.63%
## 263: 14.43% 45.91% 4.63%
## 264: 14.41% 45.87% 4.60%
## 265: 14.40% 45.85% 4.60%
## 266: 14.42% 45.91% 4.60%
## 267: 14.42% 45.91% 4.61%
## 268: 14.40% 45.85% 4.59%
## 269: 14.39% 45.85% 4.59%
## 270: 14.40% 45.92% 4.57%
## 271: 14.43% 46.01% 4.59%
## 272: 14.43% 46.05% 4.58%
## 273: 14.43% 46.11% 4.56%
## 274: 14.41% 45.98% 4.58%
## 275: 14.39% 45.91% 4.56%
## 276: 14.43% 46.01% 4.58%
## 277: 14.43% 45.96% 4.60%
## 278: 14.41% 45.81% 4.62%
## 279: 14.45% 46.03% 4.61%
## 280: 14.42% 46.03% 4.57%
## 281: 14.44% 46.01% 4.60%
## 282: 14.44% 46.00% 4.61%
## 283: 14.49% 46.09% 4.64%
## 284: 14.44% 45.94% 4.63%
## 285: 14.48% 46.09% 4.63%
## 286: 14.47% 46.03% 4.63%
## 287: 14.46% 46.05% 4.61%
## 288: 14.44% 45.92% 4.63%
## 289: 14.43% 45.92% 4.62%
## 290: 14.46% 46.00% 4.63%
## 291: 14.41% 45.81% 4.63%
## 292: 14.42% 45.89% 4.61%
## 293: 14.43% 45.85% 4.64%
## 294: 14.46% 45.96% 4.65%
## 295: 14.43% 45.94% 4.61%
## 296: 14.48% 46.05% 4.64%
## 297: 14.50% 45.98% 4.69%
## 298: 14.46% 45.94% 4.65%
## 299: 14.46% 45.98% 4.64%
## 300: 14.46% 45.98% 4.64%
## 301: 14.46% 46.00% 4.63%
## 302: 14.49% 46.14% 4.63%
## 303: 14.52% 46.31% 4.62%
## 304: 14.51% 46.29% 4.61%
## 305: 14.47% 46.16% 4.60%
## 306: 14.46% 46.14% 4.58%
## 307: 14.48% 46.16% 4.60%
## 308: 14.47% 46.11% 4.62%
## 309: 14.46% 46.18% 4.58%
## 310: 14.46% 46.11% 4.60%
## 311: 14.45% 46.00% 4.62%
## 312: 14.47% 46.05% 4.63%
## 313: 14.43% 45.96% 4.60%
## 314: 14.42% 45.89% 4.61%
## 315: 14.40% 45.85% 4.60%
## 316: 14.43% 45.91% 4.62%
## 317: 14.42% 45.91% 4.61%
## 318: 14.45% 45.91% 4.64%
## 319: 14.43% 45.85% 4.63%
## 320: 14.41% 45.85% 4.62%
## 321: 14.42% 45.81% 4.64%
## 322: 14.43% 45.89% 4.63%
## 323: 14.43% 45.96% 4.60%
## 324: 14.42% 45.92% 4.60%
## 325: 14.43% 45.96% 4.60%
## 326: 14.42% 45.92% 4.60%
## 327: 14.48% 46.07% 4.63%
## 328: 14.46% 46.07% 4.62%
## 329: 14.45% 46.00% 4.62%
## 330: 14.49% 46.05% 4.65%
## 331: 14.48% 46.05% 4.64%
## 332: 14.50% 46.16% 4.63%
## 333: 14.49% 46.18% 4.62%
## 334: 14.49% 46.16% 4.63%
## 335: 14.47% 46.09% 4.62%
## 336: 14.45% 46.11% 4.59%
## 337: 14.51% 46.14% 4.65%
## 338: 14.47% 45.98% 4.65%
## 339: 14.50% 46.07% 4.67%
## 340: 14.53% 46.09% 4.69%
## 341: 14.55% 46.07% 4.73%
## 342: 14.51% 45.96% 4.71%
## 343: 14.52% 46.07% 4.69%
## 344: 14.52% 46.09% 4.68%
## 345: 14.50% 46.07% 4.67%
## 346: 14.49% 46.03% 4.67%
## 347: 14.53% 46.16% 4.67%
## 348: 14.55% 46.12% 4.71%
## 349: 14.52% 46.12% 4.67%
## 350: 14.52% 46.12% 4.67%
## 351: 14.54% 46.16% 4.69%
## 352: 14.48% 46.05% 4.64%
## 353: 14.51% 46.01% 4.69%
## 354: 14.54% 46.16% 4.68%
## 355: 14.52% 46.05% 4.69%
## 356: 14.45% 45.91% 4.64%
## 357: 14.49% 46.01% 4.67%
## 358: 14.48% 46.05% 4.64%
## 359: 14.48% 46.00% 4.65%
## 360: 14.46% 46.03% 4.63%
## 361: 14.48% 46.05% 4.64%
## 362: 14.51% 46.12% 4.65%
## 363: 14.48% 46.07% 4.63%
## 364: 14.45% 45.98% 4.62%
## 365: 14.55% 46.18% 4.69%
## 366: 14.53% 46.14% 4.67%
## 367: 14.52% 46.07% 4.68%
## 368: 14.53% 46.05% 4.70%
## 369: 14.52% 46.09% 4.68%
## 370: 14.55% 46.16% 4.70%
## 371: 14.51% 46.03% 4.69%
## 372: 14.53% 46.01% 4.72%
## 373: 14.49% 45.92% 4.69%
## 374: 14.51% 45.98% 4.70%
## 375: 14.47% 45.91% 4.67%
## 376: 14.50% 45.98% 4.69%
## 377: 14.52% 46.07% 4.68%
## 378: 14.50% 46.07% 4.66%
## 379: 14.49% 46.03% 4.66%
## 380: 14.55% 46.14% 4.70%
## 381: 14.50% 46.00% 4.69%
## 382: 14.54% 46.11% 4.71%
## 383: 14.52% 46.12% 4.67%
## 384: 14.49% 46.03% 4.67%
## 385: 14.50% 46.05% 4.67%
## 386: 14.51% 46.03% 4.68%
## 387: 14.54% 46.16% 4.69%
## 388: 14.55% 46.18% 4.69%
## 389: 14.52% 46.12% 4.67%
## 390: 14.51% 46.16% 4.65%
## 391: 14.49% 46.07% 4.65%
## 392: 14.49% 46.07% 4.65%
## 393: 14.48% 46.11% 4.63%
## 394: 14.49% 46.11% 4.64%
## 395: 14.49% 46.05% 4.65%
## 396: 14.49% 46.05% 4.65%
## 397: 14.46% 45.94% 4.65%
## 398: 14.47% 45.98% 4.65%
## 399: 14.49% 46.05% 4.65%
## 400: 14.51% 46.07% 4.68%
## 401: 14.49% 46.01% 4.67%
## 402: 14.51% 46.05% 4.68%
## 403: 14.49% 46.07% 4.65%
## 404: 14.53% 46.12% 4.68%
## 405: 14.48% 45.98% 4.67%
## 406: 14.47% 46.00% 4.64%
## 407: 14.46% 46.05% 4.62%
## 408: 14.46% 46.03% 4.63%
## 409: 14.48% 45.92% 4.68%
## 410: 14.49% 46.01% 4.67%
## 411: 14.49% 46.09% 4.65%
## 412: 14.50% 46.03% 4.68%
## 413: 14.45% 45.96% 4.63%
## 414: 14.46% 45.96% 4.65%
## 415: 14.47% 46.05% 4.63%
## 416: 14.45% 46.01% 4.62%
## 417: 14.45% 46.01% 4.62%
## 418: 14.44% 45.98% 4.62%
## 419: 14.47% 46.03% 4.63%
## 420: 14.44% 45.92% 4.63%
## 421: 14.47% 46.00% 4.64%
## 422: 14.44% 45.96% 4.62%
## 423: 14.46% 45.91% 4.65%
## 424: 14.44% 45.91% 4.63%
## 425: 14.43% 45.85% 4.64%
## 426: 14.47% 45.96% 4.66%
## 427: 14.45% 45.89% 4.65%
## 428: 14.46% 45.92% 4.65%
## 429: 14.46% 46.01% 4.63%
## 430: 14.44% 45.89% 4.64%
## 431: 14.43% 45.81% 4.64%
## 432: 14.45% 45.89% 4.65%
## 433: 14.44% 45.85% 4.65%
## 434: 14.43% 45.85% 4.64%
## 435: 14.42% 45.87% 4.62%
## 436: 14.42% 45.83% 4.63%
## 437: 14.41% 45.87% 4.61%
## 438: 14.40% 45.80% 4.62%
## 439: 14.39% 45.76% 4.62%
## 440: 14.40% 45.85% 4.60%
## 441: 14.42% 45.81% 4.63%
## 442: 14.45% 45.87% 4.66%
## 443: 14.41% 45.74% 4.65%
## 444: 14.43% 45.81% 4.65%
## 445: 14.41% 45.72% 4.65%
## 446: 14.43% 45.78% 4.67%
## 447: 14.44% 45.76% 4.68%
## 448: 14.44% 45.85% 4.65%
## 449: 14.45% 45.91% 4.65%
## 450: 14.46% 45.94% 4.65%
## 451: 14.48% 45.98% 4.67%
## 452: 14.46% 45.94% 4.64%
## 453: 14.48% 45.98% 4.67%
## 454: 14.49% 45.98% 4.67%
## 455: 14.47% 45.96% 4.66%
## 456: 14.46% 45.91% 4.67%
## 457: 14.49% 45.98% 4.67%
## 458: 14.46% 45.87% 4.67%
## 459: 14.48% 45.96% 4.67%
## 460: 14.45% 45.87% 4.66%
## 461: 14.48% 45.96% 4.67%
## 462: 14.46% 45.91% 4.67%
## 463: 14.44% 45.89% 4.64%
## 464: 14.46% 45.89% 4.67%
## 465: 14.44% 45.92% 4.63%
## 466: 14.43% 45.85% 4.63%
## 467: 14.46% 45.89% 4.67%
## 468: 14.45% 45.83% 4.67%
## 469: 14.41% 45.78% 4.63%
## 470: 14.43% 45.92% 4.62%
## 471: 14.46% 45.98% 4.64%
## 472: 14.45% 45.96% 4.63%
## 473: 14.46% 45.98% 4.64%
## 474: 14.45% 45.96% 4.63%
## 475: 14.44% 45.91% 4.63%
## 476: 14.43% 45.89% 4.62%
## 477: 14.47% 46.03% 4.64%
## 478: 14.47% 46.05% 4.63%
## 479: 14.44% 46.01% 4.60%
## 480: 14.44% 46.00% 4.61%
## 481: 14.45% 46.03% 4.60%
## 482: 14.46% 46.07% 4.61%
## 483: 14.46% 46.07% 4.61%
## 484: 14.45% 46.03% 4.61%
## 485: 14.43% 45.96% 4.60%
## 486: 14.41% 45.98% 4.58%
## 487: 14.40% 45.94% 4.57%
## 488: 14.40% 45.98% 4.56%
## 489: 14.40% 45.94% 4.56%
## 490: 14.40% 45.96% 4.56%
## 491: 14.38% 45.87% 4.57%
## 492: 14.40% 45.92% 4.57%
## 493: 14.41% 46.01% 4.56%
## 494: 14.40% 45.94% 4.56%
## 495: 14.40% 45.94% 4.56%
## 496: 14.40% 45.98% 4.55%
## 497: 14.38% 45.89% 4.56%
## 498: 14.40% 45.92% 4.58%
## 499: 14.38% 45.91% 4.55%
## 500: 14.46% 46.14% 4.59%
model.rf
##
## Call:
## randomForest(formula = AntiVac ~ ., data = train, do.trace = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 14.46%
## Confusion matrix:
## anti support class.error
## anti 2953 2530 0.46142623
## support 807 16787 0.04586791
###Step 2: tune the model
#Find the best number of ntree
obb.error.data <- data.frame(Trees=rep(1:nrow(model.rf$err.rate),times = 3),
Type = rep(c('OOB','support','anti'),each=nrow(model.rf$err.rate)),
Error = c(model.rf$err.rate[,'OOB'],
model.rf$err.rate[,'support'],
model.rf$err.rate[,'anti']))
ggplot(data=obb.error.data,aes(x=Trees, y=Error))+
geom_line(aes(color=Type))
Then, we turned to look at several possible mtry values in models. We tested the OOB for different mtry values (from 1 to 10). The results showed that when mtry = 10 the OOB is lowest. In the current study, we did not tuned for the maxnodes but allowed trees grew to the maximum possible value.
#Find the best number of mtry
oob.values <- vector(length = 10)
for(i in 1:10){
temp.model <- randomForest(AntiVac~., data = train, ntree = 500, mtry=i)
oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate)]
}
order(oob.values) #choose mtry = 10
## [1] 10 6 8 5 9 4 7 3 2 1
Lastly, we tuned the mtry value from 5 (default) to 10. And we reran model with tuned values. The results showed that the tuned model has an accuracy of 85.74%, which is higher than the default model (85.54%).
model.rf.2 <- randomForest(AntiVac~., data = train, ntree = 500, mtry = 10)
model.rf.2
##
## Call:
## randomForest(formula = AntiVac ~ ., data = train, ntree = 500, mtry = 10)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 10
##
## OOB estimate of error rate: 14.26%
## Confusion matrix:
## anti support class.error
## anti 3110 2373 0.43279227
## support 917 16677 0.05212004
As done for decision tree model, we also check the confusion matrix and the accuracy of random forest model. The results showed that the accuracy of the random forest model (0.8584 [0.8514, 0.867]) is better than decision tree model (0.835), but not too much.
rf.pred <- predict(model.rf.2, test)
rf.mat <- confusionMatrix(rf.pred, test$AntiVac)
rf.mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction anti support
## anti 1024 292
## support 790 5587
##
## Accuracy : 0.8594
## 95% CI : (0.8514, 0.867)
## No Information Rate : 0.7642
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5688
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5645
## Specificity : 0.9503
## Pos Pred Value : 0.7781
## Neg Pred Value : 0.8761
## Prevalence : 0.2358
## Detection Rate : 0.1331
## Detection Prevalence : 0.1711
## Balanced Accuracy : 0.7574
##
## 'Positive' Class : anti
##
Here we visualize the important variables for classifying the attitudes towards vaccination. Mean decrease in Gini measures how important a variable is for estimating the value of the target variable across all the three that make up the forest. Hence, the higher the variable’s value of mean decrease Gini Score, the more important the variable is in the model.
The results we received from random forests are similar to what we have in decision trees. The trust in pharmaceutical firms plays the most important role in contributing vaccine hesitancy. Then follows the country and trust in science.
As we can see in the plot below, the demographic variables play rather important role than what we have seen in the decision tree.
varImpPlot(model.rf.2, n.var = 20)
3.3.3.2 Modeling and tuning - with trust variables
Similar to what we have done in the decision tree models, we only include trust in pharmaceutical firms and science as predictors and to see the accuracy of the model. The accuracy decreases by 3.55%.
set.seed(377)
# Run the model
model.rf.t <- randomForest(`AntiVac`~`TrustPharFirm` + `TrustScience`, data = train, do.trace = T)
## ntree OOB 1 2
## 1: 17.66% 52.48% 6.70%
## 2: 18.18% 52.48% 7.32%
## 3: 18.10% 53.72% 7.00%
## 4: 17.86% 52.61% 7.06%
## 5: 17.96% 51.89% 7.40%
## 6: 18.04% 52.18% 7.37%
## 7: 18.07% 50.91% 7.80%
## 8: 18.12% 51.45% 7.72%
## 9: 18.04% 51.73% 7.52%
## 10: 17.93% 52.03% 7.28%
## 11: 17.89% 51.04% 7.54%
## 12: 17.93% 51.24% 7.55%
## 13: 17.91% 51.46% 7.45%
## 14: 17.99% 52.01% 7.39%
## 15: 17.95% 51.83% 7.40%
## 16: 17.94% 51.57% 7.46%
## 17: 17.94% 51.82% 7.38%
## 18: 17.92% 52.16% 7.24%
## 19: 17.83% 52.09% 7.15%
## 20: 17.78% 52.20% 7.05%
## 21: 17.79% 52.45% 6.99%
## 22: 17.82% 51.83% 7.22%
## 23: 17.79% 52.22% 7.07%
## 24: 17.74% 52.11% 7.03%
## 25: 17.69% 52.02% 7.00%
## 26: 17.68% 51.91% 7.01%
## 27: 17.70% 52.36% 6.89%
## 28: 17.74% 52.51% 6.90%
## 29: 17.76% 52.89% 6.81%
## 30: 17.73% 52.64% 6.85%
## 31: 17.70% 52.56% 6.84%
## 32: 17.67% 52.51% 6.81%
## 33: 17.73% 52.84% 6.79%
## 34: 17.76% 53.05% 6.76%
## 35: 17.77% 53.00% 6.79%
## 36: 17.78% 52.87% 6.84%
## 37: 17.76% 52.84% 6.83%
## 38: 17.75% 52.84% 6.82%
## 39: 17.71% 52.62% 6.84%
## 40: 17.71% 52.29% 6.93%
## 41: 17.69% 52.34% 6.89%
## 42: 17.72% 52.43% 6.90%
## 43: 17.82% 52.84% 6.91%
## 44: 17.75% 52.60% 6.89%
## 45: 17.71% 52.62% 6.84%
## 46: 17.72% 52.69% 6.82%
## 47: 17.78% 53.18% 6.75%
## 48: 17.76% 53.15% 6.73%
## 49: 17.74% 53.20% 6.70%
## 50: 17.77% 53.27% 6.70%
## 51: 17.72% 53.22% 6.66%
## 52: 17.81% 53.20% 6.78%
## 53: 17.75% 52.98% 6.78%
## 54: 17.74% 52.78% 6.83%
## 55: 17.79% 52.65% 6.93%
## 56: 17.74% 52.54% 6.90%
## 57: 17.76% 52.60% 6.90%
## 58: 17.76% 52.73% 6.86%
## 59: 17.82% 53.02% 6.85%
## 60: 17.81% 53.13% 6.81%
## 61: 17.76% 52.93% 6.80%
## 62: 17.78% 53.24% 6.73%
## 63: 17.77% 53.31% 6.69%
## 64: 17.77% 53.22% 6.72%
## 65: 17.76% 53.02% 6.77%
## 66: 17.78% 53.18% 6.75%
## 67: 17.73% 53.13% 6.70%
## 68: 17.79% 53.18% 6.76%
## 69: 17.78% 53.27% 6.72%
## 70: 17.77% 53.07% 6.77%
## 71: 17.78% 53.31% 6.71%
## 72: 17.77% 53.20% 6.73%
## 73: 17.80% 53.22% 6.76%
## 74: 17.78% 53.05% 6.78%
## 75: 17.78% 53.07% 6.78%
## 76: 17.78% 53.02% 6.79%
## 77: 17.76% 53.15% 6.74%
## 78: 17.76% 53.26% 6.70%
## 79: 17.72% 53.26% 6.64%
## 80: 17.74% 53.24% 6.67%
## 81: 17.74% 53.36% 6.63%
## 82: 17.72% 53.22% 6.66%
## 83: 17.74% 53.13% 6.71%
## 84: 17.75% 53.05% 6.75%
## 85: 17.74% 52.87% 6.79%
## 86: 17.74% 53.04% 6.75%
## 87: 17.75% 53.13% 6.73%
## 88: 17.76% 53.04% 6.76%
## 89: 17.76% 53.02% 6.77%
## 90: 17.79% 53.26% 6.74%
## 91: 17.79% 53.35% 6.71%
## 92: 17.81% 53.51% 6.68%
## 93: 17.82% 53.51% 6.70%
## 94: 17.79% 53.57% 6.64%
## 95: 17.80% 53.47% 6.68%
## 96: 17.78% 53.38% 6.69%
## 97: 17.80% 53.42% 6.70%
## 98: 17.78% 53.42% 6.68%
## 99: 17.77% 53.46% 6.65%
## 100: 17.78% 53.36% 6.68%
## 101: 17.78% 53.22% 6.73%
## 102: 17.79% 53.24% 6.75%
## 103: 17.76% 53.02% 6.78%
## 104: 17.74% 52.85% 6.80%
## 105: 17.74% 52.89% 6.78%
## 106: 17.73% 52.96% 6.75%
## 107: 17.73% 53.05% 6.72%
## 108: 17.76% 53.13% 6.74%
## 109: 17.75% 53.04% 6.75%
## 110: 17.76% 53.11% 6.75%
## 111: 17.75% 53.05% 6.75%
## 112: 17.74% 53.05% 6.74%
## 113: 17.75% 52.98% 6.78%
## 114: 17.74% 52.84% 6.80%
## 115: 17.75% 52.89% 6.80%
## 116: 17.78% 52.95% 6.81%
## 117: 17.77% 52.87% 6.83%
## 118: 17.75% 52.85% 6.81%
## 119: 17.74% 52.87% 6.79%
## 120: 17.74% 52.82% 6.81%
## 121: 17.74% 52.71% 6.84%
## 122: 17.73% 52.67% 6.84%
## 123: 17.73% 52.71% 6.83%
## 124: 17.71% 52.64% 6.83%
## 125: 17.75% 52.80% 6.83%
## 126: 17.76% 52.91% 6.81%
## 127: 17.73% 52.73% 6.83%
## 128: 17.75% 52.84% 6.82%
## 129: 17.74% 52.69% 6.85%
## 130: 17.74% 52.71% 6.84%
## 131: 17.71% 52.71% 6.81%
## 132: 17.74% 52.78% 6.82%
## 133: 17.71% 52.73% 6.80%
## 134: 17.73% 52.87% 6.78%
## 135: 17.71% 52.91% 6.75%
## 136: 17.74% 52.98% 6.76%
## 137: 17.71% 52.85% 6.76%
## 138: 17.72% 52.84% 6.78%
## 139: 17.71% 52.76% 6.79%
## 140: 17.73% 52.95% 6.75%
## 141: 17.73% 52.85% 6.78%
## 142: 17.73% 52.89% 6.77%
## 143: 17.72% 52.80% 6.79%
## 144: 17.74% 52.87% 6.79%
## 145: 17.75% 53.02% 6.76%
## 146: 17.78% 53.13% 6.76%
## 147: 17.78% 53.18% 6.75%
## 148: 17.80% 53.16% 6.78%
## 149: 17.78% 53.22% 6.74%
## 150: 17.79% 53.27% 6.74%
## 151: 17.80% 53.29% 6.74%
## 152: 17.79% 53.29% 6.73%
## 153: 17.79% 53.27% 6.74%
## 154: 17.78% 53.27% 6.72%
## 155: 17.77% 53.33% 6.69%
## 156: 17.78% 53.40% 6.68%
## 157: 17.75% 53.29% 6.68%
## 158: 17.77% 53.20% 6.72%
## 159: 17.79% 53.36% 6.71%
## 160: 17.79% 53.49% 6.67%
## 161: 17.82% 53.58% 6.67%
## 162: 17.80% 53.60% 6.64%
## 163: 17.80% 53.62% 6.63%
## 164: 17.79% 53.62% 6.63%
## 165: 17.82% 53.67% 6.65%
## 166: 17.81% 53.46% 6.70%
## 167: 17.81% 53.47% 6.70%
## 168: 17.83% 53.51% 6.71%
## 169: 17.79% 53.27% 6.74%
## 170: 17.81% 53.38% 6.73%
## 171: 17.84% 53.53% 6.71%
## 172: 17.80% 53.36% 6.72%
## 173: 17.82% 53.47% 6.71%
## 174: 17.81% 53.47% 6.69%
## 175: 17.79% 53.38% 6.70%
## 176: 17.82% 53.46% 6.72%
## 177: 17.79% 53.49% 6.67%
## 178: 17.80% 53.60% 6.64%
## 179: 17.80% 53.55% 6.66%
## 180: 17.81% 53.55% 6.67%
## 181: 17.78% 53.55% 6.63%
## 182: 17.78% 53.49% 6.66%
## 183: 17.79% 53.60% 6.63%
## 184: 17.78% 53.53% 6.64%
## 185: 17.77% 53.49% 6.63%
## 186: 17.78% 53.57% 6.63%
## 187: 17.79% 53.55% 6.64%
## 188: 17.78% 53.57% 6.62%
## 189: 17.76% 53.60% 6.59%
## 190: 17.78% 53.69% 6.59%
## 191: 17.75% 53.67% 6.56%
## 192: 17.76% 53.62% 6.58%
## 193: 17.75% 53.69% 6.55%
## 194: 17.74% 53.71% 6.54%
## 195: 17.76% 53.73% 6.55%
## 196: 17.77% 53.80% 6.54%
## 197: 17.78% 53.82% 6.55%
## 198: 17.75% 53.77% 6.53%
## 199: 17.74% 53.78% 6.51%
## 200: 17.77% 53.78% 6.54%
## 201: 17.75% 53.75% 6.53%
## 202: 17.77% 53.84% 6.53%
## 203: 17.77% 53.82% 6.54%
## 204: 17.77% 53.75% 6.55%
## 205: 17.78% 53.78% 6.55%
## 206: 17.78% 53.80% 6.55%
## 207: 17.77% 53.75% 6.56%
## 208: 17.77% 53.78% 6.54%
## 209: 17.78% 53.86% 6.54%
## 210: 17.76% 53.75% 6.54%
## 211: 17.78% 53.77% 6.56%
## 212: 17.78% 53.73% 6.58%
## 213: 17.78% 53.62% 6.60%
## 214: 17.77% 53.62% 6.59%
## 215: 17.77% 53.64% 6.59%
## 216: 17.76% 53.73% 6.55%
## 217: 17.76% 53.66% 6.58%
## 218: 17.77% 53.71% 6.56%
## 219: 17.78% 53.78% 6.55%
## 220: 17.76% 53.73% 6.55%
## 221: 17.78% 53.69% 6.59%
## 222: 17.77% 53.71% 6.56%
## 223: 17.76% 53.67% 6.56%
## 224: 17.75% 53.71% 6.54%
## 225: 17.76% 53.71% 6.56%
## 226: 17.78% 53.71% 6.58%
## 227: 17.75% 53.71% 6.54%
## 228: 17.78% 53.71% 6.58%
## 229: 17.81% 53.69% 6.62%
## 230: 17.78% 53.75% 6.58%
## 231: 17.79% 53.67% 6.61%
## 232: 17.80% 53.58% 6.64%
## 233: 17.79% 53.62% 6.63%
## 234: 17.79% 53.69% 6.60%
## 235: 17.77% 53.58% 6.60%
## 236: 17.78% 53.62% 6.61%
## 237: 17.79% 53.62% 6.62%
## 238: 17.79% 53.62% 6.62%
## 239: 17.79% 53.60% 6.63%
## 240: 17.79% 53.58% 6.64%
## 241: 17.78% 53.47% 6.66%
## 242: 17.78% 53.57% 6.63%
## 243: 17.79% 53.62% 6.62%
## 244: 17.78% 53.49% 6.64%
## 245: 17.77% 53.55% 6.62%
## 246: 17.77% 53.60% 6.60%
## 247: 17.78% 53.58% 6.63%
## 248: 17.78% 53.55% 6.64%
## 249: 17.78% 53.58% 6.62%
## 250: 17.80% 53.64% 6.63%
## 251: 17.78% 53.57% 6.63%
## 252: 17.80% 53.53% 6.67%
## 253: 17.78% 53.55% 6.64%
## 254: 17.78% 53.46% 6.67%
## 255: 17.78% 53.46% 6.67%
## 256: 17.80% 53.40% 6.70%
## 257: 17.78% 53.33% 6.70%
## 258: 17.78% 53.36% 6.70%
## 259: 17.80% 53.36% 6.72%
## 260: 17.78% 53.33% 6.71%
## 261: 17.81% 53.46% 6.70%
## 262: 17.79% 53.38% 6.70%
## 263: 17.77% 53.42% 6.66%
## 264: 17.78% 53.36% 6.70%
## 265: 17.80% 53.44% 6.69%
## 266: 17.79% 53.35% 6.71%
## 267: 17.78% 53.27% 6.71%
## 268: 17.80% 53.38% 6.71%
## 269: 17.78% 53.38% 6.69%
## 270: 17.78% 53.44% 6.67%
## 271: 17.77% 53.42% 6.66%
## 272: 17.79% 53.40% 6.69%
## 273: 17.78% 53.33% 6.71%
## 274: 17.78% 53.38% 6.69%
## 275: 17.80% 53.40% 6.71%
## 276: 17.79% 53.33% 6.71%
## 277: 17.78% 53.29% 6.72%
## 278: 17.78% 53.22% 6.73%
## 279: 17.79% 53.29% 6.72%
## 280: 17.79% 53.24% 6.74%
## 281: 17.78% 53.16% 6.75%
## 282: 17.78% 53.20% 6.74%
## 283: 17.79% 53.26% 6.74%
## 284: 17.80% 53.36% 6.71%
## 285: 17.80% 53.29% 6.74%
## 286: 17.79% 53.27% 6.73%
## 287: 17.79% 53.24% 6.74%
## 288: 17.81% 53.29% 6.75%
## 289: 17.80% 53.27% 6.74%
## 290: 17.81% 53.24% 6.77%
## 291: 17.80% 53.27% 6.74%
## 292: 17.80% 53.31% 6.74%
## 293: 17.80% 53.27% 6.75%
## 294: 17.79% 53.24% 6.74%
## 295: 17.81% 53.29% 6.75%
## 296: 17.81% 53.29% 6.75%
## 297: 17.81% 53.33% 6.74%
## 298: 17.82% 53.31% 6.76%
## 299: 17.81% 53.33% 6.75%
## 300: 17.81% 53.31% 6.75%
## 301: 17.81% 53.33% 6.75%
## 302: 17.82% 53.31% 6.76%
## 303: 17.83% 53.35% 6.76%
## 304: 17.82% 53.27% 6.78%
## 305: 17.82% 53.24% 6.79%
## 306: 17.83% 53.35% 6.76%
## 307: 17.80% 53.18% 6.78%
## 308: 17.80% 53.24% 6.75%
## 309: 17.81% 53.24% 6.77%
## 310: 17.80% 53.22% 6.76%
## 311: 17.80% 53.18% 6.78%
## 312: 17.80% 53.18% 6.77%
## 313: 17.80% 53.22% 6.76%
## 314: 17.83% 53.35% 6.76%
## 315: 17.79% 53.26% 6.74%
## 316: 17.81% 53.26% 6.76%
## 317: 17.80% 53.18% 6.78%
## 318: 17.81% 53.22% 6.77%
## 319: 17.78% 53.20% 6.75%
## 320: 17.79% 53.13% 6.78%
## 321: 17.82% 53.24% 6.78%
## 322: 17.80% 53.26% 6.75%
## 323: 17.81% 53.35% 6.73%
## 324: 17.81% 53.27% 6.76%
## 325: 17.79% 53.22% 6.75%
## 326: 17.80% 53.26% 6.75%
## 327: 17.80% 53.27% 6.75%
## 328: 17.80% 53.26% 6.75%
## 329: 17.79% 53.24% 6.75%
## 330: 17.81% 53.27% 6.76%
## 331: 17.79% 53.18% 6.76%
## 332: 17.80% 53.26% 6.75%
## 333: 17.80% 53.24% 6.76%
## 334: 17.79% 53.24% 6.75%
## 335: 17.80% 53.29% 6.74%
## 336: 17.79% 53.20% 6.76%
## 337: 17.79% 53.20% 6.75%
## 338: 17.80% 53.26% 6.75%
## 339: 17.80% 53.38% 6.71%
## 340: 17.82% 53.46% 6.72%
## 341: 17.81% 53.40% 6.72%
## 342: 17.83% 53.47% 6.72%
## 343: 17.79% 53.44% 6.68%
## 344: 17.81% 53.46% 6.71%
## 345: 17.80% 53.44% 6.70%
## 346: 17.79% 53.46% 6.67%
## 347: 17.78% 53.46% 6.66%
## 348: 17.78% 53.47% 6.66%
## 349: 17.77% 53.51% 6.63%
## 350: 17.77% 53.55% 6.62%
## 351: 17.78% 53.64% 6.60%
## 352: 17.78% 53.60% 6.61%
## 353: 17.78% 53.58% 6.62%
## 354: 17.78% 53.53% 6.64%
## 355: 17.77% 53.55% 6.62%
## 356: 17.80% 53.57% 6.66%
## 357: 17.81% 53.55% 6.68%
## 358: 17.80% 53.55% 6.66%
## 359: 17.79% 53.57% 6.64%
## 360: 17.79% 53.58% 6.64%
## 361: 17.79% 53.58% 6.63%
## 362: 17.77% 53.55% 6.62%
## 363: 17.77% 53.62% 6.59%
## 364: 17.76% 53.53% 6.62%
## 365: 17.77% 53.58% 6.61%
## 366: 17.77% 53.57% 6.62%
## 367: 17.76% 53.49% 6.62%
## 368: 17.77% 53.53% 6.63%
## 369: 17.75% 53.51% 6.61%
## 370: 17.75% 53.57% 6.59%
## 371: 17.77% 53.55% 6.62%
## 372: 17.76% 53.57% 6.60%
## 373: 17.78% 53.57% 6.63%
## 374: 17.77% 53.57% 6.62%
## 375: 17.76% 53.57% 6.60%
## 376: 17.75% 53.57% 6.59%
## 377: 17.75% 53.57% 6.59%
## 378: 17.75% 53.49% 6.62%
## 379: 17.77% 53.53% 6.63%
## 380: 17.76% 53.53% 6.62%
## 381: 17.78% 53.49% 6.66%
## 382: 17.78% 53.46% 6.66%
## 383: 17.78% 53.46% 6.66%
## 384: 17.76% 53.38% 6.66%
## 385: 17.77% 53.46% 6.64%
## 386: 17.76% 53.44% 6.64%
## 387: 17.77% 53.51% 6.63%
## 388: 17.77% 53.47% 6.64%
## 389: 17.75% 53.42% 6.64%
## 390: 17.77% 53.40% 6.66%
## 391: 17.78% 53.35% 6.69%
## 392: 17.78% 53.40% 6.67%
## 393: 17.76% 53.35% 6.67%
## 394: 17.78% 53.38% 6.68%
## 395: 17.75% 53.38% 6.64%
## 396: 17.77% 53.46% 6.64%
## 397: 17.77% 53.42% 6.66%
## 398: 17.81% 53.47% 6.70%
## 399: 17.78% 53.42% 6.68%
## 400: 17.77% 53.38% 6.67%
## 401: 17.78% 53.33% 6.70%
## 402: 17.80% 53.35% 6.72%
## 403: 17.79% 53.38% 6.70%
## 404: 17.78% 53.33% 6.70%
## 405: 17.78% 53.38% 6.68%
## 406: 17.80% 53.36% 6.72%
## 407: 17.78% 53.26% 6.73%
## 408: 17.80% 53.35% 6.72%
## 409: 17.78% 53.27% 6.72%
## 410: 17.78% 53.26% 6.72%
## 411: 17.79% 53.33% 6.72%
## 412: 17.79% 53.26% 6.74%
## 413: 17.78% 53.20% 6.75%
## 414: 17.79% 53.18% 6.76%
## 415: 17.79% 53.24% 6.74%
## 416: 17.78% 53.20% 6.75%
## 417: 17.79% 53.24% 6.74%
## 418: 17.80% 53.20% 6.76%
## 419: 17.80% 53.18% 6.77%
## 420: 17.79% 53.18% 6.76%
## 421: 17.78% 53.11% 6.77%
## 422: 17.79% 53.20% 6.75%
## 423: 17.77% 53.20% 6.73%
## 424: 17.79% 53.26% 6.74%
## 425: 17.78% 53.27% 6.72%
## 426: 17.78% 53.24% 6.73%
## 427: 17.79% 53.26% 6.74%
## 428: 17.79% 53.27% 6.74%
## 429: 17.78% 53.26% 6.72%
## 430: 17.77% 53.24% 6.72%
## 431: 17.77% 53.24% 6.72%
## 432: 17.78% 53.26% 6.72%
## 433: 17.78% 53.29% 6.72%
## 434: 17.78% 53.24% 6.72%
## 435: 17.75% 53.24% 6.69%
## 436: 17.76% 53.29% 6.69%
## 437: 17.75% 53.24% 6.70%
## 438: 17.75% 53.26% 6.68%
## 439: 17.76% 53.35% 6.67%
## 440: 17.76% 53.36% 6.67%
## 441: 17.77% 53.38% 6.67%
## 442: 17.76% 53.40% 6.66%
## 443: 17.74% 53.40% 6.63%
## 444: 17.77% 53.42% 6.66%
## 445: 17.75% 53.44% 6.63%
## 446: 17.74% 53.40% 6.63%
## 447: 17.75% 53.35% 6.66%
## 448: 17.76% 53.35% 6.67%
## 449: 17.76% 53.38% 6.66%
## 450: 17.75% 53.36% 6.66%
## 451: 17.77% 53.38% 6.67%
## 452: 17.75% 53.35% 6.66%
## 453: 17.76% 53.40% 6.66%
## 454: 17.74% 53.38% 6.63%
## 455: 17.75% 53.40% 6.64%
## 456: 17.74% 53.40% 6.63%
## 457: 17.75% 53.40% 6.64%
## 458: 17.76% 53.42% 6.64%
## 459: 17.77% 53.40% 6.66%
## 460: 17.78% 53.44% 6.66%
## 461: 17.78% 53.40% 6.67%
## 462: 17.78% 53.42% 6.67%
## 463: 17.78% 53.42% 6.67%
## 464: 17.76% 53.40% 6.66%
## 465: 17.78% 53.42% 6.67%
## 466: 17.79% 53.42% 6.68%
## 467: 17.78% 53.40% 6.67%
## 468: 17.78% 53.40% 6.68%
## 469: 17.79% 53.42% 6.69%
## 470: 17.80% 53.40% 6.71%
## 471: 17.78% 53.40% 6.68%
## 472: 17.77% 53.42% 6.66%
## 473: 17.80% 53.44% 6.70%
## 474: 17.78% 53.44% 6.66%
## 475: 17.76% 53.46% 6.63%
## 476: 17.75% 53.46% 6.62%
## 477: 17.76% 53.44% 6.64%
## 478: 17.75% 53.44% 6.63%
## 479: 17.74% 53.46% 6.62%
## 480: 17.74% 53.47% 6.60%
## 481: 17.76% 53.49% 6.62%
## 482: 17.75% 53.47% 6.62%
## 483: 17.75% 53.47% 6.62%
## 484: 17.74% 53.47% 6.61%
## 485: 17.74% 53.47% 6.61%
## 486: 17.76% 53.47% 6.63%
## 487: 17.76% 53.42% 6.64%
## 488: 17.77% 53.42% 6.66%
## 489: 17.76% 53.44% 6.64%
## 490: 17.77% 53.44% 6.65%
## 491: 17.77% 53.46% 6.64%
## 492: 17.75% 53.46% 6.62%
## 493: 17.76% 53.47% 6.63%
## 494: 17.75% 53.47% 6.62%
## 495: 17.74% 53.47% 6.61%
## 496: 17.76% 53.53% 6.61%
## 497: 17.76% 53.51% 6.62%
## 498: 17.75% 53.57% 6.59%
## 499: 17.76% 53.51% 6.62%
## 500: 17.74% 53.51% 6.60%
model.rf.t
##
## Call:
## randomForest(formula = AntiVac ~ TrustPharFirm + TrustScience, data = train, do.trace = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 17.74%
## Confusion matrix:
## anti support class.error
## anti 2549 2934 0.53510852
## support 1161 16433 0.06598841
#compare the accuracy with model that entered all varaibles
rf.pred.t <- predict(model.rf.t, test)
rf.mat.t <- confusionMatrix(rf.pred.t, test$AntiVac)
data.frame(rf.mat$overall[1],rf.mat.t$overall[1])
## rf.mat.overall.1. rf.mat.t.overall.1.
## Accuracy 0.8593527 0.8238659
3.4 Conclusion
The following shows the accuracy of each model we have built.
ms <- data.frame(prune.accuracy, prune.accuracy.t, prune.accuracy.o, rf.mat.t$overall[1], rf.mat$overall[1])
ms <- ms %>%
rename(
DecisionTree.AllVar = prune.accuracy,
DecisionTree.TrustVar = prune.accuracy.t,
DecisionTree.2TrustVar = prune.accuracy.o,
RandomForest.AllVar= rf.mat.overall.1.,
RandomForest.2TrustVar = rf.mat.t.overall.1.
)
ms
## DecisionTree.AllVar DecisionTree.TrustVar DecisionTree.2TrustVar
## Accuracy 0.8350448 0.8285454 0.8239958
## RandomForest.2TrustVar RandomForest.AllVar
## Accuracy 0.8238659 0.8593527
Conclusively, our findings indicated that while some demographic variables - country, age, education - associated with vaccine hesitancy, our results from decision trees and random forests suggested that trust in pharmaceutical firms and trust in science are two important trust that need to be addressed to in understanding individual’s vaccine hesitancy.
4 Final Conclusions & Future Directions
Why are some people unwilling to take or even against COVID-19 vaccines? We explored the data from Twitter (tweeted between 01 October 2020 and 31 March 2021) and from Eurofound (surveyed between 15 February 2021 and 30 March 2021) to answer this research question. We found from Study 1 with Twitter data that most anti-vaccination sentiment, which often coincides with vaccine hesitancy, is associated with the discussions about the conspiracy theory. The anti-vaccination tweets demonstrated people’s belief that elites or people with power control the world and that COVID-19 or the vaccine is their agenda to depopulate the world. Study 2, with survey data, showed us that most vaccine hesitancy is associated with people’s trust in pharmaceutical firms and science and, to a lesser extent, with people’s trust in the government.
It is interesting to discover that the results of Study 1 and Study 2 resonate the debate on the definition of the terms anti-vaccination and vaccination hesitancy. For example, in the topic modeling with the Twitter data, we found words such as depopulation, Bill Gates, Dr. Fauci, test, and other conspiracy-theory-related terms, suggests that anti-vaxxers like to report distrust in a specific person or opinion. This narrow characterization of distrust reinforces Berman (2020) argument that anti-vaxxers are a more extreme sample of people who are hesitant to vaccinate. Therefore, anti-vaccination and vaccine hesitancy can differ largely and may be treated separately. This falls into the limitation of our study in that we followed the definition of vaccine hesitancy that can be used interchangeability with anti-vaccination. Since the data of Study 1 had a focus on the anti-vaccination and the data of Study 2 covered rather a broader range of attitudes toward vaccination, the translation of results from Study 1 to Study 2 can be limited. Taking this into account, the same criticism is possible for the research by Muric, Wu, and Ferrara (2021) who also use the two terms interchangeably.
Each study also has some limitations. For Study 1, the tweets were pre-selected based on the keywords about anti-vaccination, which could influence the results of topic modeling, yielding similar topics. Moreover, analyzing the Twitter data, we realized that some accounts with strong anti-vaccination sentiment might have dominated the discussion. For instance, taking the data for all six months, the 359,688 tweets are posted by only 196,043 accounts, meaning that each of the 196,043 accounts posted 1.8 tweets on average. Consequently, the results of our topic analysis may be biased and cannot be fully translated as variables that explain one’s vaccine hesitancy.
On the other hand, the survey data used in Study 2 has allowed a more direct tool to discover the variables to explain one’s vaccine hesitancy. With the question targeted to inquiring into people’s attitudes toward the COVID-19 vaccine, we could learn that a higher trust in pharmaceutical firms and science tends to associate with a lower degree of vaccine hesitancy and vice versa. However, the nature of the online survey excluded an older generation from the study as well as the fact that the respondents might not have been candid with their answers leaves some possibilities for bias. Another limitation of Study 2 is while the country variable appeared very important, we did not yet figure out why and how it matters for vaccine hesitancy. Moreover, we were unable to see the interaction effect of predictors.
Nonetheless, we observe that the results from the topic model with the Twitter data and the supervised machine learning model with the Survey data overlap to some degree. This suggests that increasing trust in institutions, including firms, government, and academia, is vital to mitigate people’s vaccine hesitancy. When more people feel that a few elites or people with the power maneuver the world, the trust will not grow and vaccine hesitancy will persist. Even though we cannot statistically compare the two results due to differences in the data and the methods, this conclusion allows us to think about the source of vaccine hesitancy and how to combat it. For future research, we suggest delving further into the relationship between people’s assessment of the government’s performance or big companies’ contribution to the society on the one hand and the evolution of their trust in institutions on the other hand to study vaccine hesitancy. We also suggest that future studies could examine whether conspiracy theories influence individual acceptance of vaccines, and we advocate to distinguish anti-vaxxers from those who are hesitant to take vaccine, especially when studying anti-vaccine movements or protests.
5 References
The authors adopted some codes from “Lecture 5 - Exercises and Solutions” of the course, Collecting and Analyzing Big Data for Social Sciences (professor Cecil Meeusen) as well as the documents of the packages (e.g., quanteda, seededlda, stringr for regular expressions) for Study 1.
For plotting of Study 1, the authors were inspired by the blog post by Rachel Tatman (https://www.kaggle.com/code/rtatman/nlp-in-r-topic-modelling)
The authors inspired by some codes from Decision tree, Bagging and Random Forests chapter in Hands-On Machine Learning with R e-book.
The authors combined and tweaked some codes from Guru99 (author Daniel Johnson) and from GeeksforGeeks “Random Forest Approach in R Programming” when building random forest in Study 2.